-
Penalized FCI for Causal Structure Learning in a Sparse DAG for Biomarker Discovery in Parkinson's Disease
Authors:
Samhita Pal,
Dhrubajyoti Ghosh,
Shu Yang
Abstract:
Parkinson's disease (PD) is a progressive neurodegenerative disorder that lacks reliable early-stage biomarkers for diagnosis, prognosis, and therapeutic monitoring. While cerebrospinal fluid (CSF) biomarkers, such as alpha-synuclein seed amplification assays (alphaSyn-SAA), offer diagnostic potential, their clinical utility is limited by invasiveness and incomplete specificity. Plasma biomarkers…
▽ More
Parkinson's disease (PD) is a progressive neurodegenerative disorder that lacks reliable early-stage biomarkers for diagnosis, prognosis, and therapeutic monitoring. While cerebrospinal fluid (CSF) biomarkers, such as alpha-synuclein seed amplification assays (alphaSyn-SAA), offer diagnostic potential, their clinical utility is limited by invasiveness and incomplete specificity. Plasma biomarkers provide a minimally invasive alternative, but their mechanistic role in PD remains unclear. A major challenge is distinguishing whether plasma biomarkers causally reflect primary neurodegenerative processes or are downstream consequences of disease progression. To address this, we leverage the Parkinson's Progression Markers Initiative (PPMI) Project 9000, containing 2,924 plasma and CSF biomarkers, to systematically infer causal relationships with disease status. However, only a sparse subset of these biomarkers and their interconnections are actually relevant for the disease. Existing causal discovery algorithms, such as Fast Causal Inference (FCI) and its variants, struggle with the high dimensionality of biomarker datasets under sparsity, limiting their scalability. We propose Penalized Fast Causal Inference (PFCI), a novel approach that incorporates sparsity constraints to efficiently infer causal structures in large-scale biological datasets. By applying PFCI to PPMI data, we aim to identify biomarkers that are causally linked to PD pathology, enabling early diagnosis and patient stratification. Our findings will facilitate biomarker-driven clinical trials and contribute to the development of neuroprotective therapies.
△ Less
Submitted 30 June, 2025;
originally announced July 2025.
-
A Spectral Confounder Adjustment for Spatial Regression with Multiple Exposures and Outcomes
Authors:
Shih-Ni Prim,
Yawen Guan,
Shu Yang,
Ana G Rappold,
K. Lloyd Hill,
Wei-Lun Tsai,
Corinna Keeler,
Brian J Reich
Abstract:
Unmeasured spatial confounding complicates exposure effect estimation in environmental health studies. This problem is exacerbated in studies with multiple health outcomes and environmental exposure variables, as the source and magnitude of confounding bias may differ across exposure/outcome pairs. We propose to mitigate the effects of spatial confounding in multivariate studies by projecting to t…
▽ More
Unmeasured spatial confounding complicates exposure effect estimation in environmental health studies. This problem is exacerbated in studies with multiple health outcomes and environmental exposure variables, as the source and magnitude of confounding bias may differ across exposure/outcome pairs. We propose to mitigate the effects of spatial confounding in multivariate studies by projecting to the spectral domain to separate relationships by the spatial scale and assuming that the confounding bias dissipates at more local scales. Under this assumption and some reasonable conditions, the random effect is uncorrelated with the exposures in local scales, ensuring causal interpretation of the regression coefficients. Our model for the exposure effects is a three-way tensor over exposure, outcome, and spatial scale. We use a canonical polyadic decomposition and shrinkage priors to encourage sparsity and borrow strength across the dimensions of the tensor. We demonstrate the performance of our method in an extensive simulation study and data analysis to understand the relationship between disaster resilience and the incidence of chronic diseases.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Lower Ricci Curvature for Hypergraphs
Authors:
Shiyi Yang,
Can Chen,
Didong Li
Abstract:
Networks with higher-order interactions, prevalent in biological, social, and information systems, are naturally represented as hypergraphs, yet their structural complexity poses fundamental challenges for geometric characterization. While curvature-based methods offer powerful insights in graph analysis, existing extensions to hypergraphs suffer from critical trade-offs: combinatorial approaches…
▽ More
Networks with higher-order interactions, prevalent in biological, social, and information systems, are naturally represented as hypergraphs, yet their structural complexity poses fundamental challenges for geometric characterization. While curvature-based methods offer powerful insights in graph analysis, existing extensions to hypergraphs suffer from critical trade-offs: combinatorial approaches such as Forman-Ricci curvature capture only coarse features, whereas geometric methods like Ollivier-Ricci curvature offer richer expressivity but demand costly optimal transport computations. To address these challenges, we introduce hypergraph lower Ricci curvature (HLRC), a novel curvature metric defined in closed form that achieves a principled balance between interpretability and efficiency. Evaluated across diverse synthetic and real-world hypergraph datasets, HLRC consistently reveals meaningful higher-order organization, distinguishing intra- from inter-community hyperedges, uncovering latent semantic labels, tracking temporal dynamics, and supporting robust clustering of hypergraphs based on global structure. By unifying geometric sensitivity with algorithmic simplicity, HLRC provides a versatile foundation for hypergraph analytics, with broad implications for tasks including node classification, anomaly detection, and generative modeling in complex systems.
△ Less
Submitted 4 June, 2025;
originally announced June 2025.
-
Are Statistical Methods Obsolete in the Era of Deep Learning?
Authors:
Skyler Wu,
Shihao Yang,
S. C. Kou
Abstract:
In the era of AI, neural networks have become increasingly popular for modeling, inference, and prediction, largely due to their potential for universal approximation. With the proliferation of such deep learning models, a question arises: are leaner statistical methods still relevant? To shed insight on this question, we employ the mechanistic nonlinear ordinary differential equation (ODE) invers…
▽ More
In the era of AI, neural networks have become increasingly popular for modeling, inference, and prediction, largely due to their potential for universal approximation. With the proliferation of such deep learning models, a question arises: are leaner statistical methods still relevant? To shed insight on this question, we employ the mechanistic nonlinear ordinary differential equation (ODE) inverse problem as a testbed, using physics-informed neural network (PINN) as a representative of the deep learning paradigm and manifold-constrained Gaussian process inference (MAGI) as a representative of statistically principled methods. Through case studies involving the SEIR model from epidemiology and the Lorenz model from chaotic dynamics, we demonstrate that statistical methods are far from obsolete, especially when working with sparse and noisy observations. On tasks such as parameter inference and trajectory reconstruction, statistically principled methods consistently achieve lower bias and variance, while using far fewer parameters and requiring less hyperparameter tuning. Statistical methods can also decisively outperform deep learning models on out-of-sample future prediction, where the absence of relevant data often leads overparameterized models astray. Additionally, we find that statistically principled approaches are more robust to accumulation of numerical imprecision and can represent the underlying system more faithful to the true governing ODEs.
△ Less
Submitted 27 May, 2025;
originally announced May 2025.
-
Discounted Online Convex Optimization: Uniform Regret Across a Continuous Interval
Authors:
Wenhao Yang,
Sifan Yang,
Lijun Zhang
Abstract:
Reflecting the greater significance of recent history over the distant past in non-stationary environments, $λ$-discounted regret has been introduced in online convex optimization (OCO) to gracefully forget past data as new information arrives. When the discount factor $λ$ is given, online gradient descent with an appropriate step size achieves an $O(1/\sqrt{1-λ})$ discounted regret. However, the…
▽ More
Reflecting the greater significance of recent history over the distant past in non-stationary environments, $λ$-discounted regret has been introduced in online convex optimization (OCO) to gracefully forget past data as new information arrives. When the discount factor $λ$ is given, online gradient descent with an appropriate step size achieves an $O(1/\sqrt{1-λ})$ discounted regret. However, the value of $λ$ is often not predetermined in real-world scenarios. This gives rise to a significant open question: is it possible to develop a discounted algorithm that adapts to an unknown discount factor. In this paper, we affirmatively answer this question by providing a novel analysis to demonstrate that smoothed OGD (SOGD) achieves a uniform $O(\sqrt{\log T/1-λ})$ discounted regret, holding for all values of $λ$ across a continuous interval simultaneously. The basic idea is to maintain multiple OGD instances to handle different discount factors, and aggregate their outputs sequentially by an online prediction algorithm named as Discounted-Normal-Predictor (DNP) (Kapralov and Panigrahy,2010). Our analysis reveals that DNP can combine the decisions of two experts, even when they operate on discounted regret with different discount factors.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Advanced Crash Causation Analysis for Freeway Safety: A Large Language Model Approach to Identifying Key Contributing Factors
Authors:
Ahmed S. Abdelrahman,
Mohamed Abdel-Aty,
Samgyu Yang,
Abdulrahman Faden
Abstract:
Understanding the factors contributing to traffic crashes and developing strategies to mitigate their severity is essential. Traditional statistical methods and machine learning models often struggle to capture the complex interactions between various factors and the unique characteristics of each crash. This research leverages large language model (LLM) to analyze freeway crash data and provide c…
▽ More
Understanding the factors contributing to traffic crashes and developing strategies to mitigate their severity is essential. Traditional statistical methods and machine learning models often struggle to capture the complex interactions between various factors and the unique characteristics of each crash. This research leverages large language model (LLM) to analyze freeway crash data and provide crash causation analysis accordingly. By compiling 226 traffic safety studies related to freeway crashes, a training dataset encompassing environmental, driver, traffic, and geometric design factors was created. The Llama3 8B model was fine-tuned using QLoRA to enhance its understanding of freeway crashes and their contributing factors, as covered in these studies. The fine-tuned Llama3 8B model was then used to identify crash causation without pre-labeled data through zero-shot classification, providing comprehensive explanations to ensure that the identified causes were reasonable and aligned with existing research. Results demonstrate that LLMs effectively identify primary crash causes such as alcohol-impaired driving, speeding, aggressive driving, and driver inattention. Incorporating event data, such as road maintenance, offers more profound insights. The model's practical applicability and potential to improve traffic safety measures were validated by a high level of agreement among researchers in the field of traffic safety, as reflected in questionnaire results with 88.89%. This research highlights the complex nature of traffic crashes and how LLMs can be used for comprehensive analysis of crash causation and other contributing factors. Moreover, it provides valuable insights and potential countermeasures to aid planners and policymakers in developing more effective and efficient traffic safety practices.
△ Less
Submitted 15 May, 2025;
originally announced May 2025.
-
Doubly Robust Fusion of Many Treatments for Policy Learning
Authors:
Ke Zhu,
Jianing Chu,
Ilya Lipkovich,
Wenyu Ye,
Shu Yang
Abstract:
Individualized treatment rules/recommendations (ITRs) aim to improve patient outcomes by tailoring treatments to the characteristics of each individual. However, when there are many treatment groups, existing methods face significant challenges due to data sparsity within treatment groups and highly unbalanced covariate distributions across groups. To address these challenges, we propose a novel c…
▽ More
Individualized treatment rules/recommendations (ITRs) aim to improve patient outcomes by tailoring treatments to the characteristics of each individual. However, when there are many treatment groups, existing methods face significant challenges due to data sparsity within treatment groups and highly unbalanced covariate distributions across groups. To address these challenges, we propose a novel calibration-weighted treatment fusion procedure that robustly balances covariates across treatment groups and fuses similar treatments using a penalized working model. The fusion procedure ensures the recovery of latent treatment group structures when either the calibration model or the outcome model is correctly specified. In the fused treatment space, practitioners can seamlessly apply state-of-the-art ITR learning methods with the flexibility to utilize a subset of covariates, thereby achieving robustness while addressing practical concerns such as fairness. We establish theoretical guarantees, including consistency, the oracle property of treatment fusion, and regret bounds when integrated with multi-armed ITR learning methods such as policy trees. Simulation studies show superior group recovery and policy value compared to existing approaches. We illustrate the practical utility of our method using a nationwide electronic health record-derived de-identified database containing data from patients with Chronic Lymphocytic Leukemia and Small Lymphocytic Lymphoma.
△ Less
Submitted 23 May, 2025; v1 submitted 12 May, 2025;
originally announced May 2025.
-
Attractor-Based Coevolving Dot Product Random Graph Model
Authors:
Shiwen Yang,
Daniel L. Sussman
Abstract:
We introduce the attractor-based coevolving dot product random graph model (ABCDPRGM) to analyze time-series network data manifesting polarizing or flocking behavior. Graphs are generated based on latent positions under the random dot product graph regime. We assign group membership to each node. When evolving through time, the latent position of each node will change based on its current position…
▽ More
We introduce the attractor-based coevolving dot product random graph model (ABCDPRGM) to analyze time-series network data manifesting polarizing or flocking behavior. Graphs are generated based on latent positions under the random dot product graph regime. We assign group membership to each node. When evolving through time, the latent position of each node will change based on its current position and two attractors, which are defined to be the centers of the latent positions of all of its neighbors who share its group membership or who have different group membership than it. Parameters are assigned to the attractors to quantify the amount of influence that the attractors have on the trajectory of the latent position of each node. We developed estimators for the parameters, demonstrated their consistency, and established convergence rates under specific assumptions. Through the ABCDPRGM, we provided a novel framework for quantifying and understanding the underlying forces influencing the polarizing or flocking behaviors in dynamic network data.
△ Less
Submitted 5 May, 2025;
originally announced May 2025.
-
Robust Estimation and Inference in Hybrid Controlled Trials for Binary Outcomes: A Case Study on Non-Small Cell Lung Cancer
Authors:
Jiajun Liu,
Ke Zhu,
Shu Yang,
Xiaofei Wang
Abstract:
Hybrid controlled trials (HCTs), which augment randomized controlled trials (RCTs) with external controls (ECs), are increasingly receiving attention as a way to address limited power, slow accrual, and ethical concerns in clinical research. However, borrowing from ECs raises critical statistical challenges in estimation and inference, especially for binary outcomes where hidden bias is harder to…
▽ More
Hybrid controlled trials (HCTs), which augment randomized controlled trials (RCTs) with external controls (ECs), are increasingly receiving attention as a way to address limited power, slow accrual, and ethical concerns in clinical research. However, borrowing from ECs raises critical statistical challenges in estimation and inference, especially for binary outcomes where hidden bias is harder to detect and estimands such as risk difference, risk ratio, and odds ratio are of primary interest. We propose a novel framework that combines doubly robust estimators for various estimands under covariate shift of ECs with conformal selective borrowing (CSB) to address outcome incomparability. CSB uses conformal inference with nearest-neighbor-based conformal scores and their label-conditional extensions to perform finite-sample exact individual-level EC selection, addressing the limited information in binary outcomes. To ensure strict type I error rate control for testing treatment effects while gaining power, we use a Fisher randomization test with the CSB estimator as the test statistic. Extensive simulations demonstrate the robust performance of our methods. We apply our method to data from CALGB 9633 and the National Cancer Database to evaluate chemotherapy effects in Stage IB non-small-cell lung cancer patients and show that the proposed method effectively mitigates hidden bias introduced by full-borrowing approaches, strictly controls the type I error rate, and improves the power over RCT-only analysis.
△ Less
Submitted 30 April, 2025;
originally announced May 2025.
-
Collaborative Inference for Sparse High-Dimensional Models with Non-Shared Data
Authors:
Yifan Gu,
Hanfang Yang,
Songshan Yang,
Hui Zou
Abstract:
In modern data analysis, statistical efficiency improvement is expected via effective collaboration among multiple data holders with non-shared data. In this article, we propose a collaborative score-type test (CST) for testing linear hypotheses, which accommodates potentially high-dimensional nuisance parameters and a diverging number of constraints and target parameters. Through a careful decomp…
▽ More
In modern data analysis, statistical efficiency improvement is expected via effective collaboration among multiple data holders with non-shared data. In this article, we propose a collaborative score-type test (CST) for testing linear hypotheses, which accommodates potentially high-dimensional nuisance parameters and a diverging number of constraints and target parameters. Through a careful decomposition of the Kiefer-Bahadur representation for the traditional score statistic, we identify and approximate the key components using aggregated local gradient information from each data source. In addition, we employ a two-stage partial penalization strategy to shrink the approximation error and mitigate the bias from the high-dimensional nuisance parameters. Unlike existing methods, the CST procedure involves constrained optimization under non-shared and high-dimensional data settings, which requires novel theoretical developments. We derive the limiting distributions for the CST statistic under the null hypothesis and the local alternatives. Besides, the CST exhibits an oracle property and achieves the global statistical efficiency. Moreover, it relaxes the stringent restrictions on the number of data sources required in the current literature. Extensive numerical studies and a real example demonstrate the effectiveness and validity of our proposed method.
△ Less
Submitted 28 April, 2025; v1 submitted 28 April, 2025;
originally announced April 2025.
-
Physics-Informed Inference Time Scaling via Simulation-Calibrated Scientific Machine Learning
Authors:
Zexi Fan,
Yan Sun,
Shihao Yang,
Yiping Lu
Abstract:
High-dimensional partial differential equations (PDEs) pose significant computational challenges across fields ranging from quantum chemistry to economics and finance. Although scientific machine learning (SciML) techniques offer approximate solutions, they often suffer from bias and neglect crucial physical insights. Inspired by inference-time scaling strategies in language models, we propose Sim…
▽ More
High-dimensional partial differential equations (PDEs) pose significant computational challenges across fields ranging from quantum chemistry to economics and finance. Although scientific machine learning (SciML) techniques offer approximate solutions, they often suffer from bias and neglect crucial physical insights. Inspired by inference-time scaling strategies in language models, we propose Simulation-Calibrated Scientific Machine Learning (SCaSML), a physics-informed framework that dynamically refines and debiases the SCiML predictions during inference by enforcing the physical laws. SCaSML leverages derived new physical laws that quantifies systematic errors and employs Monte Carlo solvers based on the Feynman-Kac and Elworthy-Bismut-Li formulas to dynamically correct the prediction. Both numerical and theoretical analysis confirms enhanced convergence rates via compute-optimal inference methods. Our numerical experiments demonstrate that SCaSML reduces errors by 20-50% compared to the base surrogate model, establishing it as the first algorithm to refine approximated solutions to high-dimensional PDE during inference. Code of SCaSML is available at https://github.com/Francis-Fan-create/SCaSML.
△ Less
Submitted 25 April, 2025; v1 submitted 22 April, 2025;
originally announced April 2025.
-
ColonScopeX: Leveraging Explainable Expert Systems with Multimodal Data for Improved Early Diagnosis of Colorectal Cancer
Authors:
Natalia Sikora,
Robert L. Manschke,
Alethea M. Tang,
Peter Dunstan,
Dean A. Harris,
Su Yang
Abstract:
Colorectal cancer (CRC) ranks as the second leading cause of cancer-related deaths and the third most prevalent malignant tumour worldwide. Early detection of CRC remains problematic due to its non-specific and often embarrassing symptoms, which patients frequently overlook or hesitate to report to clinicians. Crucially, the stage at which CRC is diagnosed significantly impacts survivability, with…
▽ More
Colorectal cancer (CRC) ranks as the second leading cause of cancer-related deaths and the third most prevalent malignant tumour worldwide. Early detection of CRC remains problematic due to its non-specific and often embarrassing symptoms, which patients frequently overlook or hesitate to report to clinicians. Crucially, the stage at which CRC is diagnosed significantly impacts survivability, with a survival rate of 80-95\% for Stage I and a stark decline to 10\% for Stage IV. Unfortunately, in the UK, only 14.4\% of cases are diagnosed at the earliest stage (Stage I).
In this study, we propose ColonScopeX, a machine learning framework utilizing explainable AI (XAI) methodologies to enhance the early detection of CRC and pre-cancerous lesions. Our approach employs a multimodal model that integrates signals from blood sample measurements, processed using the Savitzky-Golay algorithm for fingerprint smoothing, alongside comprehensive patient metadata, including medication history, comorbidities, age, weight, and BMI. By leveraging XAI techniques, we aim to render the model's decision-making process transparent and interpretable, thereby fostering greater trust and understanding in its predictions. The proposed framework could be utilised as a triage tool or a screening tool of the general population.
This research highlights the potential of combining diverse patient data sources and explainable machine learning to tackle critical challenges in medical diagnostics.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Restoring the Forecasting Power of Google Trends with Statistical Preprocessing
Authors:
Candice Djorno,
Mauricio Santillana,
Shihao Yang
Abstract:
Google Trends reports how frequently specific queries are searched on Google over time. It is widely used in research and industry to gain early insights into public interest. However, its data generation mechanism introduces missing values, sampling variability, noise, and trends. These issues arise from privacy thresholds mapping low search volumes to zeros, daily sampling variations causing dis…
▽ More
Google Trends reports how frequently specific queries are searched on Google over time. It is widely used in research and industry to gain early insights into public interest. However, its data generation mechanism introduces missing values, sampling variability, noise, and trends. These issues arise from privacy thresholds mapping low search volumes to zeros, daily sampling variations causing discrepancies across historical downloads, and algorithm updates altering volume magnitudes over time. Data quality has recently deteriorated, with more zeros and noise, even for previously stable queries. We propose a comprehensive statistical methodology to preprocess Google Trends search information using hierarchical clustering, smoothing splines, and detrending. We validate our approach by forecasting U.S. influenza hospitalizations with a univariate ARIMAX model. Compared to omitting exogenous variables, our results show that raw Google Trends data degrades modeling performance, while preprocessed signals enhance forecast accuracy by 58% nationally and 24% at the state level.
△ Less
Submitted 9 April, 2025;
originally announced April 2025.
-
Integrative Analysis of High-dimensional RCT and RWD Subject to Censoring and Hidden Confounding
Authors:
Xin Ye,
Shu Yang,
Xiaofei Wang,
Yanyan Liu
Abstract:
In this study, we focus on estimating the heterogeneous treatment effect (HTE) for survival outcome. The outcome is subject to censoring and the number of covariates is high-dimensional. We utilize data from both the randomized controlled trial (RCT), considered as the gold standard, and real-world data (RWD), possibly affected by hidden confounding factors. To achieve a more efficient HTE estimat…
▽ More
In this study, we focus on estimating the heterogeneous treatment effect (HTE) for survival outcome. The outcome is subject to censoring and the number of covariates is high-dimensional. We utilize data from both the randomized controlled trial (RCT), considered as the gold standard, and real-world data (RWD), possibly affected by hidden confounding factors. To achieve a more efficient HTE estimate, such integrative analysis requires great insight into the data generation mechanism, particularly the accurate characterization of unmeasured confounding effects/bias. With this aim, we propose a penalized-regression-based integrative approach that allows for the simultaneous estimation of parameters, selection of variables, and identification of the existence of unmeasured confounding effects. The consistency, asymptotic normality, and efficiency gains are rigorously established for the proposed estimate.
Finally, we apply the proposed method to estimate the HTE of lobar/sublobar resection on the survival of lung cancer patients. The RCT is a multicenter non-inferiority randomized phase 3 trial, and the RWD comes from a clinical oncology cancer registry in the United States. The analysis reveals that the unmeasured confounding exists and the integrative approach does enhance the efficiency for the HTE estimation.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Statistical Inference for Heterogeneous Treatment Effect with Right-censored Data from Synthesizing Randomized Clinical Trials and Real-world Data
Authors:
Guangcai Mao,
Shu Yang,
Xiaofei Wang
Abstract:
The heterogeneous treatment effect plays a crucial role in precision medicine. There is evidence that real-world data, even subject to biases, can be employed as supplementary evidence for randomized clinical trials to improve the statistical efficiency of the heterogeneous treatment effect estimation. In this paper, for survival data with right censoring, we consider estimating the heterogeneous…
▽ More
The heterogeneous treatment effect plays a crucial role in precision medicine. There is evidence that real-world data, even subject to biases, can be employed as supplementary evidence for randomized clinical trials to improve the statistical efficiency of the heterogeneous treatment effect estimation. In this paper, for survival data with right censoring, we consider estimating the heterogeneous treatment effect, defined as the difference of the treatment-specific conditional restricted mean survival times given covariates, by synthesizing evidence from randomized clinical trials and the real-world data with possible biases. We define an omnibus confounding function to characterize the effect of biases caused by unmeasured confounders, censoring, outcome heterogeneity, and measurement error, and further, identify it by combining the trial and real-world data. We propose a penalized sieve method to estimate the heterogeneous treatment effect and the confounding function and further study the theoretical properties of the proposed integrative estimators based on the theory of reproducing kernel Hilbert space and empirical process. The proposed methodology is shown to outperform the approach solely based on the trial data through simulation studies and an integrative analysis of the data from a randomized trial and a real-world registry on early-stage non-small-cell lung cancer.
△ Less
Submitted 19 March, 2025;
originally announced March 2025.
-
Doubly robust omnibus sensitivity analysis of externally controlled trials with intercurrent events
Authors:
Chenyin Gao,
Xiang Zhang,
Shu Yang
Abstract:
Externally controlled trials are crucial in clinical development when randomized controlled trials are unethical or impractical. These trials consist of a full treatment arm with the experimental treatment and a full external control arm. However, they present significant challenges in learning the treatment effect due to the lack of randomization and a parallel control group. Besides baseline inc…
▽ More
Externally controlled trials are crucial in clinical development when randomized controlled trials are unethical or impractical. These trials consist of a full treatment arm with the experimental treatment and a full external control arm. However, they present significant challenges in learning the treatment effect due to the lack of randomization and a parallel control group. Besides baseline incomparability, outcome mean non-exchangeability, caused by differences in conditional outcome distributions between external controls and counterfactual concurrent controls, is infeasible to test and may introduce biases in evaluating the treatment effect. Sensitivity analysis of outcome mean non-exchangeability is thus critically important to assess the robustness of the study's conclusions against such assumption violations. Moreover, intercurrent events, which are ubiquitous and inevitable in clinical studies, can further confound the treatment effect and hinder the interpretation of the estimated treatment effects. This paper establishes a semi-parametric framework for externally controlled trials with intercurrent events, offering doubly robust and locally optimal estimators for primary and sensitivity analyses. We develop an omnibus sensitivity analysis that accounts for both outcome mean non-exchangeability and the impacts of intercurrent events simultaneously, ensuring root-n consistency and asymptotic normality under specified conditions. The performance of the proposed sensitivity analysis is evaluated in simulation studies and a real-data problem.
△ Less
Submitted 9 March, 2025;
originally announced March 2025.
-
Predicting Long-term Urban Overheating and Their Mitigations from Nature Based Solutions Using Machine Learning and Field Measurements
Authors:
Jiwei Zou,
Lin Wang,
Senwen Yang,
Michael Lacasse,
Liangzhu,
Wang
Abstract:
Urban overheating, exacerbated by climate change, threatens public health and urban sustainability. Traditional approaches, such as numerical simulations and field measurements, face challenges due to uncertainties in input data. This study integrates field measurements with machine learning models to predict the duration and severity of future urban overheating events, focusing on the role of urb…
▽ More
Urban overheating, exacerbated by climate change, threatens public health and urban sustainability. Traditional approaches, such as numerical simulations and field measurements, face challenges due to uncertainties in input data. This study integrates field measurements with machine learning models to predict the duration and severity of future urban overheating events, focusing on the role of urban greening under different global warming (GW) scenarios. Field measurements were conducted in summer 2024 at an office campus in Ottawa, a cold-climate city. Microclimate data were collected from four locations with varying levels of greenery: a large lawn without trees (Lawn), a parking lot without greenery (Parking), an area with sparsely distributed trees (Tree), and a fully covered forested area (Forest). Machine learning models, including Artificial Neural Networks (ANN), Recurrent Neural Networks (RNN), and Long Short-Term Memory (LSTM) networks, were trained on local microclimate data, with LSTM achieving the best predictions. Four GW scenarios were analyzed, corresponding to different Shared Socioeconomic Pathways (SSP) for 2050 and 2090. Results show that the Universal Thermal Climate Index (UTCI) at the "Parking" location rises from about 27,\textdegree C under GW1.0 to 31,\textdegree C under GW3.5. Moreover, low health risk conditions (UTCI > 26,\textdegree C) increase across all locations due to climate change, regardless of greenery levels. However, tree-covered areas such as "Tree" and "Forest" effectively prevent extreme heat conditions (UTCI > 38.9,\textdegree C). These findings highlight the crucial role of urban greening in mitigating severe thermal stress and enhancing thermal comfort under future climate scenarios.
△ Less
Submitted 25 February, 2025;
originally announced February 2025.
-
Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting
Authors:
Jiecheng Lu,
Shihao Yang
Abstract:
Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative pr…
▽ More
Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.
△ Less
Submitted 10 February, 2025;
originally announced February 2025.
-
Targeted Data Fusion for Causal Survival Analysis Under Distribution Shift
Authors:
Yi Liu,
Alexander W. Levis,
Ke Zhu,
Shu Yang,
Peter B. Gilbert,
Larry Han
Abstract:
Causal inference across multiple data sources offers a promising avenue to enhance the generalizability and replicability of scientific findings. However, data integration methods for time-to-event outcomes, common in biomedical research, are underdeveloped. Existing approaches focus on binary or continuous outcomes but fail to address the unique challenges of survival analysis, such as censoring…
▽ More
Causal inference across multiple data sources offers a promising avenue to enhance the generalizability and replicability of scientific findings. However, data integration methods for time-to-event outcomes, common in biomedical research, are underdeveloped. Existing approaches focus on binary or continuous outcomes but fail to address the unique challenges of survival analysis, such as censoring and the integration of discrete and continuous time. To bridge this gap, we propose two novel methods for estimating target site-specific causal effects in multi-source settings. First, we develop a semiparametric efficient estimator for settings where individual-level data can be shared across sites. Second, we introduce a federated learning framework designed for privacy-constrained environments, which dynamically reweights source-specific contributions to account for discrepancies with the target population. Both methods leverage flexible, nonparametric machine learning models to improve robustness and efficiency. We illustrate the utility of our approaches through simulation studies and an application to multi-site randomized trials of monoclonal neutralizing antibodies for HIV-1 prevention, conducted among cisgender men and transgender persons in the United States, Brazil, Peru, and Switzerland, as well as among women in sub-Saharan Africa. Our findings underscore the potential of these methods to enable efficient, privacy-preserving causal inference for time-to-event outcomes under distribution shift.
△ Less
Submitted 14 May, 2025; v1 submitted 30 January, 2025;
originally announced January 2025.
-
Estimating the Causal Effect of Redlining on Present-day Air Pollution
Authors:
Xiaodan Zhou,
Shu Yang,
Brian J Reich
Abstract:
Recent studies have shown associations between redlining policies (1935-1974) and present-day fine particulate matter (PM$_{2.5}$) and nitrogen dioxide (NO$_2$) air pollution concentrations. In this paper, we reevaluate these associations using spatial causal inference. Redlining policies enacted in the 1930s, so there is very limited documentation of pre-treatment covariates. Consequently, tradit…
▽ More
Recent studies have shown associations between redlining policies (1935-1974) and present-day fine particulate matter (PM$_{2.5}$) and nitrogen dioxide (NO$_2$) air pollution concentrations. In this paper, we reevaluate these associations using spatial causal inference. Redlining policies enacted in the 1930s, so there is very limited documentation of pre-treatment covariates. Consequently, traditional methods fails to sufficiently account for unmeasured confounders, potentially biasing causal interpretations. By integrating historical redlining data with 2010 PM$_{2.5}$ and NO$_2$ concentrations, our study aims to discern whether a causal link exists. Our study addresses challenges with a novel spatial and non-spatial latent factor framework, using the unemployment rate, house rent and percentage of Black population in 1940 U.S. Census as proxies to reconstruct pre-treatment latent socio-economic status. We establish identification of a causal effect under broad assumptions, and use Bayesian Markov Chain Monte Carlo to quantify uncertainty. Our analysis indicates that historically redlined neighborhoods are exposed to notably higher NO$_2$ concentration. In contrast, the disparities in PM$_{2.5}$ between these neighborhoods are less pronounced. Among the cities analyzed, Los Angeles, CA, and Atlanta, GA, demonstrate the most significant effects for both NO$_2$ and PM$_{2.5}$.
△ Less
Submitted 14 March, 2025; v1 submitted 28 January, 2025;
originally announced January 2025.
-
EFiGP: Eigen-Fourier Physics-Informed Gaussian Process for Inference of Dynamic Systems
Authors:
Jianhong Chen,
Shihao Yang
Abstract:
Parameter estimation and trajectory reconstruction for data-driven dynamical systems governed by ordinary differential equations (ODEs) are essential tasks in fields such as biology, engineering, and physics. These inverse problems -- estimating ODE parameters from observational data -- are particularly challenging when the data are noisy, sparse, and the dynamics are nonlinear. We propose the Eig…
▽ More
Parameter estimation and trajectory reconstruction for data-driven dynamical systems governed by ordinary differential equations (ODEs) are essential tasks in fields such as biology, engineering, and physics. These inverse problems -- estimating ODE parameters from observational data -- are particularly challenging when the data are noisy, sparse, and the dynamics are nonlinear. We propose the Eigen-Fourier Physics-Informed Gaussian Process (EFiGP), an algorithm that integrates Fourier transformation and eigen-decomposition into a physics-informed Gaussian Process framework. This approach eliminates the need for numerical integration, significantly enhancing computational efficiency and accuracy. Built on a principled Bayesian framework, EFiGP incorporates the ODE system through probabilistic conditioning, enforcing governing equations in the Fourier domain while truncating high-frequency terms to achieve denoising and computational savings. The use of eigen-decomposition further simplifies Gaussian Process covariance operations, enabling efficient recovery of trajectories and parameters even in dense-grid settings. We validate the practical effectiveness of EFiGP on three benchmark examples, demonstrating its potential for reliable and interpretable modeling of complex dynamical systems while addressing key challenges in trajectory recovery and computational cost.
△ Less
Submitted 23 January, 2025;
originally announced January 2025.
-
COADVISE: Covariate Adjustment with Variable Selection in Randomized Controlled Trials
Authors:
Yi Liu,
Ke Zhu,
Larry Han,
Shu Yang
Abstract:
Adjusting for covariates in randomized controlled trials can enhance the credibility and efficiency of treatment effect estimation. However, handling numerous covariates and their complex (non-linear) transformations poses a challenge. Motivated by the case study of the Best Apnea Interventions for Research (BestAIR) trial data from the National Sleep Research Resource (NSRR), where the number of…
▽ More
Adjusting for covariates in randomized controlled trials can enhance the credibility and efficiency of treatment effect estimation. However, handling numerous covariates and their complex (non-linear) transformations poses a challenge. Motivated by the case study of the Best Apnea Interventions for Research (BestAIR) trial data from the National Sleep Research Resource (NSRR), where the number of covariates (p=114) is comparable to the sample size (N=196), we propose a principled Covariate Adjustment with Variable Selection (COADVISE) framework. COADVISE enables variable selection for covariates most relevant to the outcome while accommodating both linear and nonlinear adjustments. This framework ensures consistent estimates with improved efficiency over unadjusted estimators and provides robust variance estimation, even under outcome model misspecification. We demonstrate efficiency gains through theoretical analysis, extensive simulations, and a re-analysis of the BestAIR trial data to compare alternative variable selection strategies, offering cautionary recommendations. A user-friendly R package, Coadvise, is available to facilitate practical implementation.
△ Less
Submitted 26 February, 2025; v1 submitted 15 January, 2025;
originally announced January 2025.
-
Transfer Learning for Individualized Treatment Rules: Application to Sepsis Patients Data from eICU-CRD and MIMIC-III Databases
Authors:
Andong Wang,
Kelly Wentzlof,
Johnny Rajala,
Miontranese Green,
Yunshu Zhang,
Shu Yang
Abstract:
Modern precision medicine aims to utilize real-world data to provide the best treatment for an individual patient. An individualized treatment rule (ITR) maps each patient's characteristics to a recommended treatment scheme that maximizes the expected outcome of the patient. A challenge precision medicine faces is population heterogeneity, as studies on treatment effects are often conducted on sou…
▽ More
Modern precision medicine aims to utilize real-world data to provide the best treatment for an individual patient. An individualized treatment rule (ITR) maps each patient's characteristics to a recommended treatment scheme that maximizes the expected outcome of the patient. A challenge precision medicine faces is population heterogeneity, as studies on treatment effects are often conducted on source populations that differ from the populations of interest in terms of the distribution of patient characteristics. Our research goal is to explore a transfer learning algorithm that aims to address the population heterogeneity problem and obtain targeted, optimal, and interpretable ITRs. The algorithm incorporates a calibrated augmented inverse probability weighting (CAIPW) estimator for the average treatment effect (ATE) and employs value function maximization for the target population using Genetic Algorithm (GA) to produce our desired ITR. To demonstrate its practical utility, we apply this transfer learning algorithm to two large medical databases, Electronic Intensive Care Unit Collaborative Research Database (eICU-CRD) and Medical Information Mart for Intensive Care III (MIMIC-III). We first identify the important covariates, treatment options, and outcomes of interest based on the two databases, and then estimate the optimal linear ITRs for patients with sepsis. Our research introduces and applies new techniques for data fusion to obtain data-driven ITRs that cater to patients' individual medical needs in a population of interest. By emphasizing generalizability and personalized decision-making, this methodology extends its potential application beyond medicine to fields such as marketing, technology, social sciences, and education.
△ Less
Submitted 3 January, 2025;
originally announced January 2025.
-
Cost-aware Portfolios in a Large Universe of Assets
Authors:
Qingliang Fan,
Marcelo C. Medeiros,
Hanming Yang,
Songshan Yang
Abstract:
This paper considers the finite horizon portfolio rebalancing problem in terms of mean-variance optimization, where decisions are made based on current information on asset returns and transaction costs. The study's novelty is that the transaction costs are integrated within the optimization problem in a high-dimensional portfolio setting where the number of assets is larger than the sample size.…
▽ More
This paper considers the finite horizon portfolio rebalancing problem in terms of mean-variance optimization, where decisions are made based on current information on asset returns and transaction costs. The study's novelty is that the transaction costs are integrated within the optimization problem in a high-dimensional portfolio setting where the number of assets is larger than the sample size. We propose portfolio construction and rebalancing models with nonconvex penalty considering two types of transaction cost, the proportional transaction cost and the quadratic transaction cost. We establish the desired theoretical properties under mild regularity conditions. Monte Carlo simulations and empirical studies using S&P 500 and Russell 2000 stocks show the satisfactory performance of the proposed portfolio and highlight the importance of involving the transaction costs when rebalancing a portfolio.
△ Less
Submitted 16 December, 2024;
originally announced December 2024.
-
Integrating Dual Prototypes for Task-Wise Adaption in Pre-Trained Model-Based Class-Incremental Learning
Authors:
Zhiming Xu,
Suorong Yang,
Baile Xu,
Furao Shen,
Jian Zhao
Abstract:
Class-incremental learning (CIL) aims to acquire new classes while conserving historical knowledge incrementally. Despite existing pre-trained model (PTM) based methods performing excellently in CIL, it is better to fine-tune them on downstream incremental tasks with massive patterns unknown to PTMs. However, using task streams for fine-tuning could lead to \textit{catastrophic forgetting} that wi…
▽ More
Class-incremental learning (CIL) aims to acquire new classes while conserving historical knowledge incrementally. Despite existing pre-trained model (PTM) based methods performing excellently in CIL, it is better to fine-tune them on downstream incremental tasks with massive patterns unknown to PTMs. However, using task streams for fine-tuning could lead to \textit{catastrophic forgetting} that will erase the knowledge in PTMs. This paper proposes the Dual Prototype network for Task-wise Adaption (DPTA) of PTM-based CIL. For each incremental learning task, an adapter module is built to fine-tune the PTM, where the center-adapt loss forces the representation to be more centrally clustered and class separable. The dual prototype network improves the prediction process by enabling test-time adapter selection, where the raw prototypes deduce several possible task indexes of test samples to select suitable adapter modules for PTM, and the augmented prototypes that could separate highly correlated classes are utilized to determine the final result. Experiments on several benchmark datasets demonstrate the excellent performance of DPTA. Code is available in https://github.com/Yorkxzm/DPTA
△ Less
Submitted 1 July, 2025; v1 submitted 26 November, 2024;
originally announced November 2024.
-
Robust Inference for High-dimensional Linear Models with Heavy-tailed Errors via Partial Gini Covariance
Authors:
Yilin Zhang,
Songshan Yang,
Yunan Wu,
Lan Wang
Abstract:
This paper introduces the partial Gini covariance, a novel dependence measure that addresses the challenges of high-dimensional inference with heavy-tailed errors, often encountered in fields like finance, insurance, climate, and biology. Conventional high-dimensional regression inference methods suffer from inaccurate type I errors and reduced power in heavy-tailed contexts, limiting their effect…
▽ More
This paper introduces the partial Gini covariance, a novel dependence measure that addresses the challenges of high-dimensional inference with heavy-tailed errors, often encountered in fields like finance, insurance, climate, and biology. Conventional high-dimensional regression inference methods suffer from inaccurate type I errors and reduced power in heavy-tailed contexts, limiting their effectiveness. Our proposed approach leverages the partial Gini covariance to construct a robust statistical inference framework that requires minimal tuning and does not impose restrictive moment conditions on error distributions. Unlike traditional methods, it circumvents the need for estimating the density of random errors and enhances the computational feasibility and robustness. Extensive simulations demonstrate the proposed method's superior power and robustness over standard high-dimensional inference approaches, such as those based on the debiased Lasso. The asymptotic relative efficiency analysis provides additional theoretical insight on the improved efficiency of the new approach in the heavy-tailed setting. Additionally, the partial Gini covariance extends to the multivariate setting, enabling chi-square testing for a group of coefficients. We illustrate the method's practical application with a real-world data example.
△ Less
Submitted 20 November, 2024; v1 submitted 19 November, 2024;
originally announced November 2024.
-
O-MAGIC: Online Change-Point Detection for Dynamic Systems
Authors:
Yan Sun,
Yeping Wang,
Zhaohui Li,
Shihao Yang
Abstract:
The capture of changes in dynamic systems, especially ordinary differential equations (ODEs), is an important and challenging task, with multiple applications in biomedical research and other scientific areas. This article proposes a fast and mathematically rigorous online method, called ODE-informed MAnifold-constrained Gaussian process Inference for Change point detection(O-MAGIC), to detect cha…
▽ More
The capture of changes in dynamic systems, especially ordinary differential equations (ODEs), is an important and challenging task, with multiple applications in biomedical research and other scientific areas. This article proposes a fast and mathematically rigorous online method, called ODE-informed MAnifold-constrained Gaussian process Inference for Change point detection(O-MAGIC), to detect changes of parameters in the ODE system using noisy and sparse observation data. O-MAGIC imposes a Gaussian process prior to the time series of system components with a latent manifold constraint, induced by restricting the derivative process to satisfy ODE conditions. To detect the parameter changes from the observation, we propose a procedure based on a two-sample generalized likelihood ratio (GLR) test that can detect multiple change points in the dynamic system automatically. O-MAGIC bypasses conventional numerical integration and achieves substantial savings in computation time. By incorporating the ODE structures through manifold constraints, O-MAGIC enjoys a significant advantage in detection delay, while following principled statistical construction under the Bayesian paradigm, which further enables it to handle systems with missing data or unobserved components. O-MAGIC can also be applied to general nonlinear systems. Simulation studies on three challenging examples: SEIRD model, Lotka-Volterra model and Lorenz model are provided to illustrate the robustness and efficiency of O-MAGIC, compared with numerical integration and other popular time-series-based change point detection benchmark methods.
△ Less
Submitted 19 November, 2024;
originally announced November 2024.
-
$\spadesuit$ SPADE $\spadesuit$ Split Peak Attention DEcomposition
Authors:
Malcolm Wolff,
Kin G. Olivares,
Boris Oreshkin,
Sunny Ruan,
Sitan Yang,
Abhinav Katoch,
Shankar Ramasubramanian,
Youxin Zhang,
Michael W. Mahoney,
Dmitry Efimov,
Vincent Quenneville-Bélair
Abstract:
Demand forecasting faces challenges induced by Peak Events (PEs) corresponding to special periods such as promotions and holidays. Peak events create significant spikes in demand followed by demand ramp down periods. Neural networks like MQCNN and MQT overreact to demand peaks by carrying over the elevated PE demand into subsequent Post-Peak-Event (PPE) periods, resulting in significantly over-bia…
▽ More
Demand forecasting faces challenges induced by Peak Events (PEs) corresponding to special periods such as promotions and holidays. Peak events create significant spikes in demand followed by demand ramp down periods. Neural networks like MQCNN and MQT overreact to demand peaks by carrying over the elevated PE demand into subsequent Post-Peak-Event (PPE) periods, resulting in significantly over-biased forecasts. To tackle this challenge, we introduce a neural forecasting model called Split Peak Attention DEcomposition, SPADE. This model reduces the impact of PEs on subsequent forecasts by modeling forecasting as consisting of two separate tasks: one for PEs; and the other for the rest. Its architecture then uses masked convolution filters and a specialized Peak Attention module. We show SPADE's performance on a worldwide retail dataset with hundreds of millions of products. Our results reveal an overall PPE improvement of 4.5%, a 30% improvement for most affected forecasts after promotions and holidays, and an improvement in PE accuracy by 3.9%, relative to current production models.
△ Less
Submitted 21 January, 2025; v1 submitted 6 November, 2024;
originally announced November 2024.
-
Spatial causal inference in the presence of preferential sampling to study the impacts of marine protected areas
Authors:
Dongjae Son,
Brian J. Reich,
Erin M. Schliep,
Shu Yang,
David A. Gill
Abstract:
Marine Protected Areas (MPAs) have been established globally to conserve marine resources. Given their maintenance costs and impact on commercial fishing, it is critical to evaluate their effectiveness to support future conservation. In this paper, we use data collected from the Australian coast to estimate the effect of MPAs on biodiversity. Environmental studies such as these are often observati…
▽ More
Marine Protected Areas (MPAs) have been established globally to conserve marine resources. Given their maintenance costs and impact on commercial fishing, it is critical to evaluate their effectiveness to support future conservation. In this paper, we use data collected from the Australian coast to estimate the effect of MPAs on biodiversity. Environmental studies such as these are often observational, and processes of interest exhibit spatial dependence, which presents challenges in estimating the causal effects. Spatial data can also be subject to preferential sampling, where the sampling locations are related to the response variable, further complicating inference and prediction. To address these challenges, we propose a spatial causal inference method that simultaneously accounts for unmeasured spatial confounders in both the sampling process and the treatment allocation. We prove the identifiability of key parameters in the model and the consistency of the posterior distributions of those parameters. We show via simulation studies that the causal effect of interest can be reliably estimated under the proposed model. The proposed method is applied to assess the effect of MPAs on fish biomass. We find evidence of preferential sampling and that properly accounting for this source of bias impacts the estimate of the causal effect.
△ Less
Submitted 28 October, 2024;
originally announced October 2024.
-
Doubly protected estimation for survival outcomes utilizing external controls for randomized clinical trials
Authors:
Chenyin Gao,
Shu Yang,
Mingyang Shan,
Wenyu Wendy Ye,
Ilya Lipkovich,
Douglas Faries
Abstract:
Censored survival data are common in clinical trials, but small control groups can pose challenges, particularly in rare diseases or where balanced randomization is impractical. Recent approaches leverage external controls from historical studies or real-world data to strengthen treatment evaluation for survival outcomes. However, using external controls directly may introduce biases due to data h…
▽ More
Censored survival data are common in clinical trials, but small control groups can pose challenges, particularly in rare diseases or where balanced randomization is impractical. Recent approaches leverage external controls from historical studies or real-world data to strengthen treatment evaluation for survival outcomes. However, using external controls directly may introduce biases due to data heterogeneity. We propose a doubly protected estimator for the treatment-specific restricted mean survival time difference that is more efficient than trial-only estimators and mitigates biases from external data. Our method adjusts for covariate shifts via doubly robust estimation and addresses outcome drift using the DR-Learner for selective borrowing. The approach can incorporate machine learning to approximate survival curves and detect outcome drifts without strict parametric assumptions, borrowing only comparable external controls. Extensive simulation studies and a real-data application evaluating the efficacy of Galcanezumab in mitigating migraine headaches have been conducted to illustrate the effectiveness of our proposed framework.
△ Less
Submitted 14 May, 2025; v1 submitted 23 October, 2024;
originally announced October 2024.
-
Enhancing Statistical Validity and Power in Hybrid Controlled Trials: A Randomization Inference Approach with Conformal Selective Borrowing
Authors:
Ke Zhu,
Shu Yang,
Xiaofei Wang
Abstract:
External controls from historical trials or observational data can augment randomized controlled trials when large-scale randomization is impractical or unethical, such as in drug evaluation for rare diseases. However, non-randomized external controls can introduce biases, and existing Bayesian and frequentist methods may inflate the type I error rate, particularly in small-sample trials where ext…
▽ More
External controls from historical trials or observational data can augment randomized controlled trials when large-scale randomization is impractical or unethical, such as in drug evaluation for rare diseases. However, non-randomized external controls can introduce biases, and existing Bayesian and frequentist methods may inflate the type I error rate, particularly in small-sample trials where external data borrowing is most critical. To address these challenges, we propose a randomization inference framework that ensures finite-sample exact and model-free type I error rate control, adhering to the "analyze as you randomize" principle to safeguard against hidden biases. Recognizing that biased external controls reduce the power of randomization tests, we leverage conformal inference to develop an individualized test-then-pool procedure that selectively borrows comparable external controls to improve power. Our approach incorporates selection uncertainty into randomization tests, providing valid post-selection inference. Additionally, we propose an adaptive procedure to optimize the selection threshold by minimizing the mean squared error across a class of estimators encompassing both no-borrowing and full-borrowing approaches. The proposed methods are supported by non-asymptotic theoretical analysis, validated through simulations, and applied to a randomized lung cancer trial that integrates external controls from the National Cancer Database.
△ Less
Submitted 7 May, 2025; v1 submitted 15 October, 2024;
originally announced October 2024.
-
Clustering Alzheimer's Disease Subtypes via Similarity Learning and Graph Diffusion
Authors:
Tianyi Wei,
Shu Yang,
Davoud Ataee Tarzanagh,
Jingxuan Bao,
Jia Xu,
Patryk Orzechowski,
Joost B. Wagenaar,
Qi Long,
Li Shen
Abstract:
Alzheimer's disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Due to the heterogeneous nature of AD, its diagnosis and treatment pose critical challenges. Consequently, there is a growing research interest in identifying homogeneous AD subtypes that can assist in addressing these challenges in recent years. In this study, we aim to identify subtypes of…
▽ More
Alzheimer's disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Due to the heterogeneous nature of AD, its diagnosis and treatment pose critical challenges. Consequently, there is a growing research interest in identifying homogeneous AD subtypes that can assist in addressing these challenges in recent years. In this study, we aim to identify subtypes of AD that represent distinctive clinical features and underlying pathology by utilizing unsupervised clustering with graph diffusion and similarity learning. We adopted SIMLR, a multi-kernel similarity learning framework, and graph diffusion to perform clustering on a group of 829 patients with AD and mild cognitive impairment (MCI, a prodromal stage of AD) based on their cortical thickness measurements extracted from magnetic resonance imaging (MRI) scans. Although the clustering approach we utilized has not been explored for the task of AD subtyping before, it demonstrated significantly better performance than several commonly used clustering methods. Specifically, we showed the power of graph diffusion in reducing the effects of noise in the subtype detection. Our results revealed five subtypes that differed remarkably in their biomarkers, cognitive status, and some other clinical features. To evaluate the resultant subtypes further, a genetic association study was carried out and successfully identified potential genetic underpinnings of different AD subtypes. Our source code is available at: https://github.com/PennShenLab/AD-SIMLR.
△ Less
Submitted 4 October, 2024;
originally announced October 2024.
-
WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting
Authors:
Jiecheng Lu,
Xu Han,
Yan Sun,
Shihao Yang
Abstract:
We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the pr…
▽ More
We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.
△ Less
Submitted 11 February, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Expected Diverse Utility (EDU): Diverse Bayesian Optimization of Expensive Computer Simulators
Authors:
John Joshua Miller,
Simon Mak,
Benny Sun,
Sai Ranjeet Narayanan,
Suo Yang,
Zongxuan Sun,
Kenneth S. Kim,
Chol-Bum Mike Kweon
Abstract:
The optimization of expensive black-box simulators arises in a myriad of modern scientific and engineering applications. Bayesian optimization provides an appealing solution, by leveraging a fitted surrogate model to guide the selection of subsequent simulator evaluations. In practice, however, the objective is often not to obtain a single good solution, but rather a ``basket'' of good solutions f…
▽ More
The optimization of expensive black-box simulators arises in a myriad of modern scientific and engineering applications. Bayesian optimization provides an appealing solution, by leveraging a fitted surrogate model to guide the selection of subsequent simulator evaluations. In practice, however, the objective is often not to obtain a single good solution, but rather a ``basket'' of good solutions from which users can choose for downstream decision-making. This need arises in our motivating application for real-time control of internal combustion engines for flight propulsion, where a diverse set of control strategies is essential for stable flight control. There has been little work on this front for Bayesian optimization. We thus propose a new Expected Diverse Utility (EDU) method that searches for diverse ``$ε$-optimal'' solutions: locally-optimal solutions within a tolerance level $ε> 0$ from a global optimum. We show that EDU yields a closed-form acquisition function under a Gaussian process surrogate model, which facilitates efficient sequential queries via automatic differentiation. This closed form further reveals a novel exploration-exploitation-diversity trade-off, which incorporates the desired diversity property within the well-known exploration-exploitation trade-off. We demonstrate the improvement of EDU over existing methods in a suite of numerical experiments, then explore the EDU in two applications on rover trajectory optimization and engine control for flight propulsion.
△ Less
Submitted 2 February, 2025; v1 submitted 1 October, 2024;
originally announced October 2024.
-
Double-Estimation-Friendly Inference for High-Dimensional Measurement Error Models with Non-Sparse Adaptability
Authors:
Shijie Cui,
Xu Guo,
Songshan Yang,
Zhe Zhang
Abstract:
In this paper, we introduce an innovative testing procedure for assessing individual hypotheses in high-dimensional linear regression models with measurement errors. This method remains robust even when either the X-model or Y-model is misspecified. We develop a double robust score function that maintains a zero expectation if one of the models is incorrect, and we construct a corresponding score…
▽ More
In this paper, we introduce an innovative testing procedure for assessing individual hypotheses in high-dimensional linear regression models with measurement errors. This method remains robust even when either the X-model or Y-model is misspecified. We develop a double robust score function that maintains a zero expectation if one of the models is incorrect, and we construct a corresponding score test. We first show the asymptotic normality of our approach in a low-dimensional setting, and then extend it to the high-dimensional models. Our analysis of high-dimensional settings explores scenarios both with and without the sparsity condition, establishing asymptotic normality and non-trivial power performance under local alternatives. Simulation studies and real data analysis demonstrate the effectiveness of the proposed method.
△ Less
Submitted 11 January, 2025; v1 submitted 24 September, 2024;
originally announced September 2024.
-
Off-Policy Evaluation with Irregularly-Spaced, Outcome-Dependent Observation Times
Authors:
Xin Chen,
Wenbin Lu,
Shu Yang,
Dipankar Bandyopadhyay
Abstract:
While the classic off-policy evaluation (OPE) literature commonly assumes decision time points to be evenly spaced for simplicity, in many real-world scenarios, such as those involving user-initiated visits, decisions are made at irregularly-spaced and potentially outcome-dependent time points. For a more principled evaluation of the dynamic policies, this paper constructs a novel OPE framework, w…
▽ More
While the classic off-policy evaluation (OPE) literature commonly assumes decision time points to be evenly spaced for simplicity, in many real-world scenarios, such as those involving user-initiated visits, decisions are made at irregularly-spaced and potentially outcome-dependent time points. For a more principled evaluation of the dynamic policies, this paper constructs a novel OPE framework, which concerns not only the state-action process but also an observation process dictating the time points at which decisions are made. The framework is closely connected to the Markov decision process in computer science and with the renewal process in the statistical literature. Within the framework, two distinct value functions, derived from cumulative reward and integrated reward respectively, are considered, and statistical inference for each value function is developed under revised Markov and time-homogeneous assumptions. The validity of the proposed method is further supported by theoretical results, simulation studies, and a real-world application from electronic health records (EHR) evaluating periodontal disease treatments.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
Improve Sensitivity Analysis Synthesizing Randomized Clinical Trials With Limited Overlap
Authors:
Kuan Jiang,
Wenjie Hu,
Shu Yang,
Xinxing Lai,
Xiaohua Zhou
Abstract:
Randomized clinical trials are the gold standard when estimating the average treatment effect. However, they are usually not a random sample from the real-world population because of the inclusion/exclusion rules. Meanwhile, observational studies typically consist of representative samples from the real-world population. However, due to unmeasured confounding, sensitivity analysis is often used to…
▽ More
Randomized clinical trials are the gold standard when estimating the average treatment effect. However, they are usually not a random sample from the real-world population because of the inclusion/exclusion rules. Meanwhile, observational studies typically consist of representative samples from the real-world population. However, due to unmeasured confounding, sensitivity analysis is often used to estimate bounds for the average treatment effect without relying on stringent assumptions of other existing methods. This article introduces a synthesis estimator that improves sensitivity analysis in observational studies by incorporating randomized clinical trial data, even when overlap in covariate distribution is limited due to inclusion/exclusion criteria. We show that the proposed estimator will give a tighter bound when a "separability" condition holds for the sensitivity parameter. Theoretical proofs and simulations show that this method provides a tighter bound than the sensitivity analysis using only observational study. We apply this method to combine an observational study on drug effectiveness with a partially overlapping RCT dataset, yielding improved average treatment effect bounds.
△ Less
Submitted 10 December, 2024; v1 submitted 11 September, 2024;
originally announced September 2024.
-
Deep Uncertainty-Based Explore for Index Construction and Retrieval in Recommendation System
Authors:
Xin Jiang,
Kaiqiang Wang,
Yinlong Wang,
Fengchang Lv,
Taiyang Peng,
Shuai Yang,
Xianteng Wu,
Pengye Zhang,
Shuo Yuan,
Yifan Zeng
Abstract:
In recommendation systems, the relevance and novelty of the final results are selected through a cascade system of Matching -> Ranking -> Strategy. The matching model serves as the starting point of the pipeline and determines the upper bound of the subsequent stages. Balancing the relevance and novelty of matching results is a crucial step in the design and optimization of recommendation systems,…
▽ More
In recommendation systems, the relevance and novelty of the final results are selected through a cascade system of Matching -> Ranking -> Strategy. The matching model serves as the starting point of the pipeline and determines the upper bound of the subsequent stages. Balancing the relevance and novelty of matching results is a crucial step in the design and optimization of recommendation systems, contributing significantly to improving recommendation quality. However, the typical matching algorithms have not simultaneously addressed the relevance and novelty perfectly. One main reason is that deep matching algorithms exhibit significant uncertainty when estimating items in the long tail (e.g., due to insufficient training samples) items.The uncertainty not only affects the training of the models but also influences the confidence in the index construction and beam search retrieval process of these models. This paper proposes the UICR (Uncertainty-based explore for Index Construction and Retrieval) algorithm, which introduces the concept of uncertainty modeling in the matching stage and achieves multi-task modeling of model uncertainty and index uncertainty. The final matching results are obtained by combining the relevance score and uncertainty score infered by the model. Experimental results demonstrate that the UICR improves novelty without sacrificing relevance on realworld industrial productive environments and multiple open-source datasets. Remarkably, online A/B test results of display advertising in Shopee demonstrates the effectiveness of the proposed algorithm.
△ Less
Submitted 5 August, 2024; v1 submitted 21 July, 2024;
originally announced August 2024.
-
Conversational Dueling Bandits in Generalized Linear Models
Authors:
Shuhua Yang,
Hui Yuan,
Xiaoying Zhang,
Mengdi Wang,
Hong Zhang,
Huazheng Wang
Abstract:
Conversational recommendation systems elicit user preferences by interacting with users to obtain their feedback on recommended commodities. Such systems utilize a multi-armed bandit framework to learn user preferences in an online manner and have received great success in recent years. However, existing conversational bandit methods have several limitations. First, they only enable users to provi…
▽ More
Conversational recommendation systems elicit user preferences by interacting with users to obtain their feedback on recommended commodities. Such systems utilize a multi-armed bandit framework to learn user preferences in an online manner and have received great success in recent years. However, existing conversational bandit methods have several limitations. First, they only enable users to provide explicit binary feedback on the recommended items or categories, leading to ambiguity in interpretation. In practice, users are usually faced with more than one choice. Relative feedback, known for its informativeness, has gained increasing popularity in recommendation system design. Moreover, current contextual bandit methods mainly work under linear reward assumptions, ignoring practical non-linear reward structures in generalized linear models. Therefore, in this paper, we introduce relative feedback-based conversations into conversational recommendation systems through the integration of dueling bandits in generalized linear models (GLM) and propose a novel conversational dueling bandit algorithm called ConDuel. Theoretical analyses of regret upper bounds and empirical validations on synthetic and real-world data underscore ConDuel's efficacy. We also demonstrate the potential to extend our algorithm to multinomial logit bandits with theoretical and experimental guarantees, which further proves the applicability of the proposed framework.
△ Less
Submitted 25 July, 2024;
originally announced July 2024.
-
High-dimensional log contrast models with measurement errors
Authors:
Wenxi Tan,
Lingzhou Xue,
Songshan Yang,
Xiang Zhan
Abstract:
High-dimensional compositional data are frequently encountered in many fields of modern scientific research. In regression analysis of compositional data, the presence of covariate measurement errors poses grand challenges for existing statistical error-in-variable regression analysis methods since measurement error in one component of the composition has an impact on others. To simultaneously add…
▽ More
High-dimensional compositional data are frequently encountered in many fields of modern scientific research. In regression analysis of compositional data, the presence of covariate measurement errors poses grand challenges for existing statistical error-in-variable regression analysis methods since measurement error in one component of the composition has an impact on others. To simultaneously address the compositional nature and measurement errors in the high-dimensional design matrix of compositional covariates, we propose a new method named Error-in-composition (Eric) Lasso for regression analysis of corrupted compositional predictors. Estimation error bounds of Eric Lasso and its asymptotic sign-consistent selection properties are established. We then illustrate the finite sample performance of Eric Lasso using simulation studies and demonstrate its potential usefulness in a real data application example.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
UQE: A Query Engine for Unstructured Databases
Authors:
Hanjun Dai,
Bethany Yixin Wang,
Xingchen Wan,
Bo Dai,
Sherry Yang,
Azade Nova,
Pengcheng Yin,
Phitchaya Mangpo Phothilimthana,
Charles Sutton,
Dale Schuurmans
Abstract:
Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data…
▽ More
Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.
△ Less
Submitted 16 November, 2024; v1 submitted 23 June, 2024;
originally announced July 2024.
-
Bayesian Structured Mediation Analysis With Unobserved Confounders
Authors:
Yuliang Xu,
Shu Yang,
Jian Kang
Abstract:
We explore methods to reduce the impact of unobserved confounders on the causal mediation analysis of high-dimensional mediators with spatially smooth structures, such as brain imaging data. The key approach is to incorporate the latent individual effects, which influence the structured mediators, as unobserved confounders in the outcome model, thereby potentially debiasing the mediation effects.…
▽ More
We explore methods to reduce the impact of unobserved confounders on the causal mediation analysis of high-dimensional mediators with spatially smooth structures, such as brain imaging data. The key approach is to incorporate the latent individual effects, which influence the structured mediators, as unobserved confounders in the outcome model, thereby potentially debiasing the mediation effects. We develop BAyesian Structured Mediation analysis with Unobserved confounders (BASMU) framework, and establish its model identifiability conditions. Theoretical analysis is conducted on the asymptotic bias of the Natural Indirect Effect (NIE) and the Natural Direct Effect (NDE) when the unobserved confounders are omitted in mediation analysis. For BASMU, we propose a two-stage estimation algorithm to mitigate the impact of these unobserved confounders on estimating the mediation effect. Extensive simulations demonstrate that BASMU substantially reduces the bias in various scenarios. We apply BASMU to the analysis of fMRI data in the Adolescent Brain Cognitive Development (ABCD) study, focusing on four brain regions previously reported to exhibit meaningful mediation effects. Compared with the existing image mediation analysis method, BASMU identifies two to four times more voxels that have significant mediation effects, with the NIE increased by 41%, and the NDE decreased by 26%.
△ Less
Submitted 4 July, 2024;
originally announced July 2024.
-
F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data
Authors:
Zexing Xu,
Linjun Zhang,
Sitan Yang,
Rasoul Etesami,
Hanghang Tong,
Huan Zhang,
Jiawei Han
Abstract:
Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns f…
▽ More
Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns from similar entities during non-peak periods, enriched by features learned from a graph neural networks (GNNs)-based forecasting model, to predict demand during peak events. We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm that leverages proxy data from non-peak periods and GNN-generated relational metadata to learn feature-specific layer parameters, thereby adapting to demand forecasts for peak events. Theoretically, we show that by considering domain similarities through task-specific metadata, our model achieves improved generalization, where the excess risk decreases as the number of training tasks increases. Empirical evaluations on large-scale industrial datasets demonstrate the superiority of our approach. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Semiparametric Localized Principal Stratification Analysis with Continuous Strata
Authors:
Yichi Zhang,
Shu Yang
Abstract:
Principal stratification is essential for revealing causal mechanisms involving post-treatment intermediate variables, in real-world applications like surrogate marker evaluation. Principal stratification analysis with continuous intermediate variables is increasingly common but challenging due to the infinite principal strata and the nonidentifiability and nonregularity of principal causal effect…
▽ More
Principal stratification is essential for revealing causal mechanisms involving post-treatment intermediate variables, in real-world applications like surrogate marker evaluation. Principal stratification analysis with continuous intermediate variables is increasingly common but challenging due to the infinite principal strata and the nonidentifiability and nonregularity of principal causal effects. Inspired by recent research, we resolve these challenges by first using a flexible copula-based principal score model to identify principal causal effect under weak principal ignorability. We then target the local functional substitute of principal causal effect, which is statistically regular and can accurately approximate principal causal effect with vanishing bandwidth. We simplify the full efficient influence function of the local functional substitute by considering its oracle-scenario alternative. This leads to a computationally efficient and straightforward estimator for the local functional substitute and principal causal effect with vanishing bandwidth. We prove the double robustness of our proposed estimator, and derive its asymptotic normality for inferential purposes. With a vanishing bandwidth, our method attains minimax optimality for the nonparametric estimation of the principal causal effect. With a fixed bandwidth, it achieves semiparametric efficiency in estimating its local functional substitute. We demonstrate the strong performance of our proposed estimator through simulations and apply it to surrogate analysis of short-term CD4 count in ACTG 175.
△ Less
Submitted 29 January, 2025; v1 submitted 19 June, 2024;
originally announced June 2024.
-
A Practical Analysis Procedure on Generalizing Comparative Effectiveness in the Randomized Clinical Trial to the Real-world Trialeligible Population
Authors:
Kuan Jiang,
Xin-xing Lai,
Shu Yang,
Ying Gao,
Xiao-Hua Zhou
Abstract:
When evaluating the effectiveness of a drug, a Randomized Controlled Trial (RCT) is often considered the gold standard due to its perfect randomization. While RCT assures strong internal validity, its restricted external validity poses challenges in extending treatment effects to the broader real-world population due to possible heterogeneity in covariates. In this paper, we introduce a procedure…
▽ More
When evaluating the effectiveness of a drug, a Randomized Controlled Trial (RCT) is often considered the gold standard due to its perfect randomization. While RCT assures strong internal validity, its restricted external validity poses challenges in extending treatment effects to the broader real-world population due to possible heterogeneity in covariates. In this paper, we introduce a procedure to generalize the RCT findings to the real-world trial-eligible population based on the adaption of existing statistical methods. We utilized the augmented inversed probability of sampling weighting (AIPSW) estimator for the estimation and omitted variable bias framework to assess the robustness of the estimate against the assumption violation caused by potentially unmeasured confounders. We analyzed an RCT comparing the effectiveness of lowering hypertension between Songling Xuemaikang Capsule (SXC), a traditional Chinese medicine (TCM), and Losartan as an illustration. The generalization results indicated that although SXC is less effective in lowering blood pressure than Losartan on week 2, week 4, and week 6, there is no statistically significant difference among the trial-eligible population at week 8, and the generalization is robust against potential unmeasured confounders.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF
Authors:
Shicong Cen,
Jincheng Mei,
Katayoon Goshvadi,
Hanjun Dai,
Tong Yang,
Sherry Yang,
Dale Schuurmans,
Yuejie Chi,
Bo Dai
Abstract:
Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas of investigation. A key bottleneck is understanding how to incorporate uncertainty estimation in the reward function learned from the preference data for RLHF,…
▽ More
Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas of investigation. A key bottleneck is understanding how to incorporate uncertainty estimation in the reward function learned from the preference data for RLHF, regardless of how the preference data is collected. While the principles of optimism or pessimism under uncertainty are well-established in standard reinforcement learning (RL), a practically-implementable and theoretically-grounded form amenable to large language models is not yet available, as standard techniques for constructing confidence intervals become intractable under arbitrary policy parameterizations.
In this paper, we introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO) -- which regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a $\textit{sign}$ to indicate whether the optimism or pessimism is chosen. VPO also directly optimizes the policy with implicit reward modeling, and therefore shares a simpler RLHF pipeline similar to direct preference optimization. Theoretical guarantees of VPO are provided for both online and offline settings, matching the rates of their standard RL counterparts. Moreover, experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
△ Less
Submitted 18 February, 2025; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Matrix Manifold Neural Networks++
Authors:
Xuan Son Nguyen,
Shuo Yang,
Aymeric Histace
Abstract:
Deep neural networks (DNNs) on Riemannian manifolds have garnered increasing interest in various applied areas. For instance, DNNs on spherical and hyperbolic manifolds have been designed to solve a wide range of computer vision and nature language processing tasks. One of the key factors that contribute to the success of these networks is that spherical and hyperbolic manifolds have the rich alge…
▽ More
Deep neural networks (DNNs) on Riemannian manifolds have garnered increasing interest in various applied areas. For instance, DNNs on spherical and hyperbolic manifolds have been designed to solve a wide range of computer vision and nature language processing tasks. One of the key factors that contribute to the success of these networks is that spherical and hyperbolic manifolds have the rich algebraic structures of gyrogroups and gyrovector spaces. This enables principled and effective generalizations of the most successful DNNs to these manifolds. Recently, some works have shown that many concepts in the theory of gyrogroups and gyrovector spaces can also be generalized to matrix manifolds such as Symmetric Positive Definite (SPD) and Grassmann manifolds. As a result, some building blocks for SPD and Grassmann neural networks, e.g., isometric models and multinomial logistic regression (MLR) can be derived in a way that is fully analogous to their spherical and hyperbolic counterparts. Building upon these works, we design fully-connected (FC) and convolutional layers for SPD neural networks. We also develop MLR on Symmetric Positive Semi-definite (SPSD) manifolds, and propose a method for performing backpropagation with the Grassmann logarithmic map in the projector perspective. We demonstrate the effectiveness of the proposed approach in the human action recognition and node classification tasks.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Inference for Optimal Linear Treatment Regimes in Personalized Decision-making
Authors:
Yuwen Cheng,
Shu Yang
Abstract:
Personalized decision-making, tailored to individual characteristics, is gaining significant attention. The optimal treatment regime aims to provide the best-expected outcome in the entire population, known as the value function. One approach to determine this optimal regime is by maximizing the Augmented Inverse Probability Weighting (AIPW) estimator of the value function. However, the derived tr…
▽ More
Personalized decision-making, tailored to individual characteristics, is gaining significant attention. The optimal treatment regime aims to provide the best-expected outcome in the entire population, known as the value function. One approach to determine this optimal regime is by maximizing the Augmented Inverse Probability Weighting (AIPW) estimator of the value function. However, the derived treatment regime can be intricate and nonlinear, limiting their use. For clarity and interoperability, we emphasize linear regimes and determine the optimal linear regime by optimizing the AIPW estimator within set constraints.
While the AIPW estimator offers a viable path to estimating the optimal regime, current methodologies predominantly focus on its asymptotic distribution, leaving a gap in studying the linear regime itself. However, there are many benefits to understanding the regime, as pinpointing significant covariates can enhance treatment effects and provide future clinical guidance. In this paper, we explore the asymptotic distribution of the estimated linear regime. Our results show that the parameter associated with the linear regime follows a cube-root convergence to a non-normal limiting distribution characterized by the maximizer of a centered Gaussian process with a quadratic drift. When making inferences for the estimated linear regimes with cube-root convergence in practical scenarios, the standard nonparametric bootstrap is invalid. As a solution, we facilitate the Cattaneo et al. (2020) bootstrap technique to provide a consistent distributional approximation for the estimated linear regimes, validated further through simulations and real-world data applications from the eICU Collaborative Research Database.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
In-context Time Series Predictor
Authors:
Jiecheng Lu,
Yan Sun,
Shihao Yang
Abstract:
Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate "time series forecast…
▽ More
Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate "time series forecasting tasks" as input tokens by constructing a series of (lookback, future) pairs within the tokens. This method aligns more closely with the inherent in-context mechanisms, and is more parameter-efficient without the need of using pre-trained LLM parameters. Furthermore, it addresses issues such as overfitting in existing Transformer-based TSF models, consistently achieving better performance across full-data, few-shot, and zero-shot settings compared to previous architectures.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Causal Customer Churn Analysis with Low-rank Tensor Block Hazard Model
Authors:
Chenyin Gao,
Zhiming Zhang,
Shu Yang
Abstract:
This study introduces an innovative method for analyzing the impact of various interventions on customer churn, using the potential outcomes framework. We present a new causal model, the tensorized latent factor block hazard model, which incorporates tensor completion methods for a principled causal analysis of customer churn. A crucial element of our approach is the formulation of a 1-bit tensor…
▽ More
This study introduces an innovative method for analyzing the impact of various interventions on customer churn, using the potential outcomes framework. We present a new causal model, the tensorized latent factor block hazard model, which incorporates tensor completion methods for a principled causal analysis of customer churn. A crucial element of our approach is the formulation of a 1-bit tensor completion for the parameter tensor. This captures hidden customer characteristics and temporal elements from churn records, effectively addressing the binary nature of churn data and its time-monotonic trends. Our model also uniquely categorizes interventions by their similar impacts, enhancing the precision and practicality of implementing customer retention strategies. For computational efficiency, we apply a projected gradient descent algorithm combined with spectral clustering. We lay down the theoretical groundwork for our model, including its non-asymptotic properties. The efficacy and superiority of our model are further validated through comprehensive experiments on both simulated and real-world applications.
△ Less
Submitted 18 May, 2024;
originally announced May 2024.