-
Statistical methods for cost-effectiveness analysis of left-truncated censored survival data with treatment delays
Authors:
Polyna Khudyakov,
Li Xu,
Ce Yang,
Donna Spiegelman,
Molin Wang
Abstract:
The incremental cost-effectiveness ratio (ICER) and incremental net benefit (INB) are widely used for cost-effectiveness analysis. We develop methods for estimation and inference for the ICER and INB which use the semiparametric stratified Cox proportional hazard model, allowing for adjustment for risk factors. Since in public health settings, patients often begin treatment after they become eligi…
▽ More
The incremental cost-effectiveness ratio (ICER) and incremental net benefit (INB) are widely used for cost-effectiveness analysis. We develop methods for estimation and inference for the ICER and INB which use the semiparametric stratified Cox proportional hazard model, allowing for adjustment for risk factors. Since in public health settings, patients often begin treatment after they become eligible, we account for delay times in treatment initiation. Excellent finite sample properties of the proposed estimator are demonstrated in an extensive simulation study under different delay scenarios. We apply the proposed method to evaluate the cost-effectiveness of switching treatments among AIDS patients in Tanzania.
△ Less
Submitted 9 May, 2025;
originally announced May 2025.
-
Statistical Inference for High-Dimensional Robust Linear Regression Models via Recursive Online-Score Estimation
Authors:
Dian Zheng,
Lingzhou Xue
Abstract:
This paper introduces a novel framework for estimation and inference in penalized M-estimators applied to robust high-dimensional linear regression models. Traditional methods for high-dimensional statistical inference, which predominantly rely on convex likelihood-based approaches, struggle to address the nonconvexity inherent in penalized M-estimation with nonconvex objective functions. Our prop…
▽ More
This paper introduces a novel framework for estimation and inference in penalized M-estimators applied to robust high-dimensional linear regression models. Traditional methods for high-dimensional statistical inference, which predominantly rely on convex likelihood-based approaches, struggle to address the nonconvexity inherent in penalized M-estimation with nonconvex objective functions. Our proposed method extends the recursive online score estimation (ROSE) framework of Shi et al. (2021) to robust high-dimensional settings by developing a recursive score equation based on penalized M-estimation, explicitly addressing nonconvexity. We establish the statistical consistency and asymptotic normality of the resulting estimator, providing a rigorous foundation for valid inference in robust high-dimensional regression. The effectiveness of our method is demonstrated through simulation studies and a real-world application, showcasing its superior performance compared to existing approaches.
△ Less
Submitted 12 April, 2025;
originally announced April 2025.
-
Adaptive Design for Contour Estimation from Computer Experiments with Quantitative and Qualitative Inputs
Authors:
A. Shahrokhian,
X. Deng,
C. D. Lin,
P. Ranjan,
L. Xu
Abstract:
Computer experiments with quantitative and qualitative inputs are widely used to study many scientific and engineering processes. Much of the existing work has focused on design and modeling or process optimization for such experiments. This paper proposes an adaptive design approach for estimating a contour from computer experiments with quantitative and qualitative inputs. A new criterion is int…
▽ More
Computer experiments with quantitative and qualitative inputs are widely used to study many scientific and engineering processes. Much of the existing work has focused on design and modeling or process optimization for such experiments. This paper proposes an adaptive design approach for estimating a contour from computer experiments with quantitative and qualitative inputs. A new criterion is introduced to search for the follow-up inputs. The key features of the proposed criterion are (a) the criterion yields adaptive search regions; and (b) it is region-based cooperative in that for each stage of the sequential procedure, the candidate points in the design space is divided into two disjoint groups using confidence bounds, and within each group, an acquisition function is used to select a candidate point. Among the two selected points, a point that is closer to the contour level with the higher uncertainty or that has higher uncertainty when the distance between its prediction and the contour level is within a threshold is chosen. The proposed approach provides empirically more accurate contour estimation than existing approaches as illustrated in numerical examples and a real application. Theoretical justification of the proposed adaptive search region is given.
△ Less
Submitted 29 April, 2025; v1 submitted 7 April, 2025;
originally announced April 2025.
-
Statistical Inference for High-dimensional Matrix-variate Factor Models with Missing Observations
Authors:
Yongxia Zhang,
Jinwen Liang,
Liwen Xu,
Keming Yu,
Maozai Tian
Abstract:
This paper develops an inferential theory for high-dimensional matrix-variate factor models with missing observations. We propose an easy-to-use all-purpose method that involves two straightforward steps. First, we perform principal component analysis on two re-weighted covariance matrices to obtain the row and column loadings. Second, we utilize these loadings along with the matrix-variate data t…
▽ More
This paper develops an inferential theory for high-dimensional matrix-variate factor models with missing observations. We propose an easy-to-use all-purpose method that involves two straightforward steps. First, we perform principal component analysis on two re-weighted covariance matrices to obtain the row and column loadings. Second, we utilize these loadings along with the matrix-variate data to derive the factors. We develop an inferential theory that establishes the consistency and the rate of convergence under general conditions and missing patterns. The simulation results demonstrate the adequacy of the asymptotic results in approximating the properties of a finite sample. Finally, we illustrate the application of our method using a real numerical dataset.
△ Less
Submitted 24 March, 2025;
originally announced March 2025.
-
Manifold learning in metric spaces
Authors:
Liane Xu,
Amit Singer
Abstract:
Laplacian-based methods are popular for dimensionality reduction of data lying in $\mathbb{R}^N$. Several theoretical results for these algorithms depend on the fact that the Euclidean distance approximates the geodesic distance on the underlying submanifold which the data are assumed to lie on. However, for some applications, other metrics, such as the Wasserstein distance, may provide a more app…
▽ More
Laplacian-based methods are popular for dimensionality reduction of data lying in $\mathbb{R}^N$. Several theoretical results for these algorithms depend on the fact that the Euclidean distance approximates the geodesic distance on the underlying submanifold which the data are assumed to lie on. However, for some applications, other metrics, such as the Wasserstein distance, may provide a more appropriate notion of distance than the Euclidean distance. We provide a framework that generalizes the problem of manifold learning to metric spaces and study when a metric satisfies sufficient conditions for the pointwise convergence of the graph Laplacian.
△ Less
Submitted 20 March, 2025;
originally announced March 2025.
-
Gap-Dependent Bounds for Federated $Q$-learning
Authors:
Haochen Zhang,
Zhong Zheng,
Lingzhou Xue
Abstract:
We present the first gap-dependent analysis of regret and communication cost for on-policy federated $Q$-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to $\sqrt{T}$-type regret bounds and communication cost bounds with a $\log T$ term scaling with the number of agents $M$, states $S$, and actions $A$, where…
▽ More
We present the first gap-dependent analysis of regret and communication cost for on-policy federated $Q$-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to $\sqrt{T}$-type regret bounds and communication cost bounds with a $\log T$ term scaling with the number of agents $M$, states $S$, and actions $A$, where $T$ is the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve a $\log T$-type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap-dependent regret bound reveals a distinct multi-agent speedup pattern, and our gap-dependent communication cost bound removes the dependence on $MSA$ from the $\log T$ term. Notably, our gap-dependent communication cost bound also yields a better global switching cost when $M=1$, removing $SA$ from the $\log T$ term.
△ Less
Submitted 4 February, 2025;
originally announced February 2025.
-
Clustering of functional data prone to complex heteroscedastic measurement error
Authors:
Andi Mai,
Lan Xue,
Roger Zoh,
Carmen Tekwe
Abstract:
Several factors make clustering of functional data challenging, including the infinite-dimensional space to which observations belong and the lack of a defined probability density function for the functional random variable. To overcome these barriers, researchers either assume that observations belong to a finite-dimensional space spanned by basis functions or apply nonparametric smoothing method…
▽ More
Several factors make clustering of functional data challenging, including the infinite-dimensional space to which observations belong and the lack of a defined probability density function for the functional random variable. To overcome these barriers, researchers either assume that observations belong to a finite-dimensional space spanned by basis functions or apply nonparametric smoothing methods to the functions prior to clustering. Although extensive literature describes clustering methods for functional data, few studies have explored the clustering of measurement error--prone function-valued data. In this work, we consider clustering methods for functional data prone to complex, heteroscedastic measurement errors. Two stage-based methods using mixed-effects models are first applied to adjust for measurement error bias, followed by cluster analysis of the measurement error--adjusted curves. Through simulations, we investigate how varying sample size, the magnitude of measurement error, and the presence of complex heteroscedastic measurement errors influence the cluster analysis of functional data. Our results indicate that failing to account for measurement errors and the correlation structures associated with frequently collected functional data reduces the accuracy of identifying the true latent groups or clusters. The method consistently produces better results regardless of the initial clustering values used. Moreover, it is flexible and can be applied to various clustering approaches, based on the specific distribution of the data. The developed methods are applied to two data sets: a school-based study of energy expenditure among elementary school-aged children in Texas and data from the National Health and Nutrition Examination Survey on participants' physical activity monitored by wearable devices at frequent intervals.
△ Less
Submitted 31 January, 2025; v1 submitted 24 January, 2025;
originally announced January 2025.
-
Avoiding subtraction and division of stochastic signals using normalizing flows: NFdeconvolve
Authors:
Pedro Pessoa,
Max Schweiger,
Lance W. Q. Xu,
Tristan Manha,
Ayush Saurabh,
Julian Antolin Camarena,
Steve Pressé
Abstract:
Across the scientific realm, we find ourselves subtracting or dividing stochastic signals. For instance, consider a stochastic realization, $x$, generated from the addition or multiplication of two stochastic signals $a$ and $b$, namely $x=a+b$ or $x = ab$. For the $x=a+b$ example, $a$ can be fluorescence background and $b$ the signal of interest whose statistics are to be learned from the measure…
▽ More
Across the scientific realm, we find ourselves subtracting or dividing stochastic signals. For instance, consider a stochastic realization, $x$, generated from the addition or multiplication of two stochastic signals $a$ and $b$, namely $x=a+b$ or $x = ab$. For the $x=a+b$ example, $a$ can be fluorescence background and $b$ the signal of interest whose statistics are to be learned from the measured $x$. Similarly, when writing $x=ab$, $a$ can be thought of as the illumination intensity and $b$ the density of fluorescent molecules of interest. Yet dividing or subtracting stochastic signals amplifies noise, and we ask instead whether, using the statistics of $a$ and the measurement of $x$ as input, we can recover the statistics of $b$. Here, we show how normalizing flows can generate an approximation of the probability distribution over $b$, thereby avoiding subtraction or division altogether. This method is implemented in our software package, NFdeconvolve, available on GitHub with a tutorial linked in the main text.
△ Less
Submitted 14 January, 2025;
originally announced January 2025.
-
Statistical Convergence Rates of Optimal Transport Map Estimation between General Distributions
Authors:
Yizhe Ding,
Runze Li,
Lingzhou Xue
Abstract:
This paper studies the convergence rates of optimal transport (OT) map estimators, a topic of growing interest in statistics, machine learning, and various scientific fields. Despite recent advancements, existing results rely on regularity assumptions that are very restrictive in practice and much stricter than those in Brenier's Theorem, including the compactness and convexity of the probability…
▽ More
This paper studies the convergence rates of optimal transport (OT) map estimators, a topic of growing interest in statistics, machine learning, and various scientific fields. Despite recent advancements, existing results rely on regularity assumptions that are very restrictive in practice and much stricter than those in Brenier's Theorem, including the compactness and convexity of the probability support and the bi-Lipschitz property of the OT maps. We aim to broaden the scope of OT map estimation and fill this gap between theory and practice. Given the strong convexity assumption on Brenier's potential, we first establish the non-asymptotic convergence rates for the original plug-in estimator without requiring restrictive assumptions on probability measures. Additionally, we introduce a sieve plug-in estimator and establish its convergence rates without the strong convexity assumption on Brenier's potential, enabling the widely used cases such as the rank functions of normal or t-distributions. We also establish new Poincaré-type inequalities, which are proved given sufficient conditions on the local boundedness of the probability density and mild topological conditions of the support, and these new inequalities enable us to achieve faster convergence rates for the Donsker function class. Moreover, we develop scalable algorithms to efficiently solve the OT map estimation using neural networks and present numerical experiments to demonstrate the effectiveness and robustness.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Hypothesis Testing for High-Dimensional Matrix-Valued Data
Authors:
Shijie Cui,
Danning Li,
Runze Li,
Lingzhou Xue
Abstract:
This paper addresses hypothesis testing for the mean of matrix-valued data in high-dimensional settings. We investigate the minimum discrepancy test, originally proposed by Cragg (1997), which serves as a rank test for lower-dimensional matrices. We evaluate the performance of this test as the matrix dimensions increase proportionally with the sample size, and identify its limitations when matrix…
▽ More
This paper addresses hypothesis testing for the mean of matrix-valued data in high-dimensional settings. We investigate the minimum discrepancy test, originally proposed by Cragg (1997), which serves as a rank test for lower-dimensional matrices. We evaluate the performance of this test as the matrix dimensions increase proportionally with the sample size, and identify its limitations when matrix dimensions significantly exceed the sample size. To address these challenges, we propose a new test statistic tailored for high-dimensional matrix rank testing. The oracle version of this statistic is analyzed to highlight its theoretical properties. Additionally, we develop a novel approach for constructing a sparse singular value decomposition (SVD) estimator for singular vectors, providing a comprehensive examination of its theoretical aspects. Using the sparse SVD estimator, we explore the properties of the sample version of our proposed statistic. The paper concludes with simulation studies and two case studies involving surveillance video data, demonstrating the practical utility of our proposed methods.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
Error estimates between SGD with momentum and underdamped Langevin diffusion
Authors:
Arnaud Guillin,
Yu Wang,
Lihu Xu,
Haoran Yang
Abstract:
Stochastic gradient descent with momentum is a popular variant of stochastic gradient descent, which has recently been reported to have a close relationship with the underdamped Langevin diffusion. In this paper, we establish a quantitative error estimate between them in the 1-Wasserstein and total variation distances.
Stochastic gradient descent with momentum is a popular variant of stochastic gradient descent, which has recently been reported to have a close relationship with the underdamped Langevin diffusion. In this paper, we establish a quantitative error estimate between them in the 1-Wasserstein and total variation distances.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
-
The Effect of Personalization in FedProx: A Fine-grained Analysis on Statistical Accuracy and Communication Efficiency
Authors:
Xin Yu,
Zelin He,
Ying Sun,
Lingzhou Xue,
Runze Li
Abstract:
FedProx is a simple yet effective federated learning method that enables model personalization via regularization. Despite remarkable success in practice, a rigorous analysis of how such a regularization provably improves the statistical accuracy of each client's local model hasn't been fully established. Setting the regularization strength heuristically presents a risk, as an inappropriate choice…
▽ More
FedProx is a simple yet effective federated learning method that enables model personalization via regularization. Despite remarkable success in practice, a rigorous analysis of how such a regularization provably improves the statistical accuracy of each client's local model hasn't been fully established. Setting the regularization strength heuristically presents a risk, as an inappropriate choice may even degrade accuracy. This work fills in the gap by analyzing the effect of regularization on statistical accuracy, thereby providing a theoretical guideline for setting the regularization strength for achieving personalization. We prove that by adaptively choosing the regularization strength under different statistical heterogeneity, FedProx can consistently outperform pure local training and achieve a \textit{minimax-optimal} statistical rate. In addition, to shed light on resource allocation, we design an algorithm, provably showing that stronger personalization reduces communication complexity without increasing the computation cost overhead. Finally, our theory is validated on both synthetic and real-world datasets and its generalizability is verified in a non-convex setting.
△ Less
Submitted 4 December, 2024; v1 submitted 11 October, 2024;
originally announced October 2024.
-
Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition
Authors:
Zhong Zheng,
Haochen Zhang,
Lingzhou Xue
Abstract:
We study the gap-dependent bounds of two important algorithms for on-policy Q-learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the almost optimal $\sqrt{T}$-type regret bound in…
▽ More
We study the gap-dependent bounds of two important algorithms for on-policy Q-learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the almost optimal $\sqrt{T}$-type regret bound in the worst-case scenario, where $T$ is the total number of steps. However, the benign structures of the MDPs such as a strictly positive suboptimality gap can significantly improve the regret. While gap-dependent regret bounds have been obtained for Q-learning with Hoeffding-type bonuses, it remains an open question to establish gap-dependent regret bounds for Q-learning using variance estimators in their bonuses and reference-advantage decomposition for variance reduction. We develop a novel error decomposition framework to prove gap-dependent regret bounds of UCB-Advantage and Q-EarlySettled-Advantage that are logarithmic in $T$ and improve upon existing ones for Q-learning algorithms. Moreover, we establish the gap-dependent bound for the policy switching cost of UCB-Advantage and improve that under the worst-case MDPs. To our knowledge, this paper presents the first gap-dependent regret analysis for Q-learning using variance estimators and reference-advantage decomposition and also provides the first gap-dependent analysis on policy switching cost for Q-learning.
△ Less
Submitted 9 March, 2025; v1 submitted 9 October, 2024;
originally announced October 2024.
-
Smoothed Robust Phase Retrieval
Authors:
Zhong Zheng,
Lingzhou Xue
Abstract:
The phase retrieval problem in the presence of noise aims to recover the signal vector of interest from a set of quadratic measurements with infrequent but arbitrary corruptions, and it plays an important role in many scientific applications. However, the essential geometric structure of the nonconvex robust phase retrieval based on the $\ell_1$-loss is largely unknown to study spurious local solu…
▽ More
The phase retrieval problem in the presence of noise aims to recover the signal vector of interest from a set of quadratic measurements with infrequent but arbitrary corruptions, and it plays an important role in many scientific applications. However, the essential geometric structure of the nonconvex robust phase retrieval based on the $\ell_1$-loss is largely unknown to study spurious local solutions, even under the ideal noiseless setting, and its intrinsic nonsmooth nature also impacts the efficiency of optimization algorithms. This paper introduces the smoothed robust phase retrieval (SRPR) based on a family of convolution-type smoothed loss functions. Theoretically, we prove that the SRPR enjoys a benign geometric structure with high probability: (1) under the noiseless situation, the SRPR has no spurious local solutions, and the target signals are global solutions, and (2) under the infrequent but arbitrary corruptions, we characterize the stationary points of the SRPR and prove its benign landscape, which is the first landscape analysis of phase retrieval with corruption in the literature. Moreover, we prove the local linear convergence rate of gradient descent for solving the SRPR under the noiseless situation. Experiments on both simulated datasets and image recovery are provided to demonstrate the numerical performance of the SRPR.
△ Less
Submitted 2 September, 2024;
originally announced September 2024.
-
High-dimensional log contrast models with measurement errors
Authors:
Wenxi Tan,
Lingzhou Xue,
Songshan Yang,
Xiang Zhan
Abstract:
High-dimensional compositional data are frequently encountered in many fields of modern scientific research. In regression analysis of compositional data, the presence of covariate measurement errors poses grand challenges for existing statistical error-in-variable regression analysis methods since measurement error in one component of the composition has an impact on others. To simultaneously add…
▽ More
High-dimensional compositional data are frequently encountered in many fields of modern scientific research. In regression analysis of compositional data, the presence of covariate measurement errors poses grand challenges for existing statistical error-in-variable regression analysis methods since measurement error in one component of the composition has an impact on others. To simultaneously address the compositional nature and measurement errors in the high-dimensional design matrix of compositional covariates, we propose a new method named Error-in-composition (Eric) Lasso for regression analysis of corrupted compositional predictors. Estimation error bounds of Eric Lasso and its asymptotic sign-consistent selection properties are established. We then illustrate the finite sample performance of Eric Lasso using simulation studies and demonstrate its potential usefulness in a real data application example.
△ Less
Submitted 21 July, 2024;
originally announced July 2024.
-
Clustering functional data with measurement errors: a simulation-based approach
Authors:
Tingyu Zhu,
Lan Xue,
Carmen Tekwe,
Keith Diaz,
Mark Benden,
Roger Zoh
Abstract:
Clustering analysis of functional data, which comprises observations that evolve continuously over time or space, has gained increasing attention across various scientific disciplines. Practical applications often involve functional data that are contaminated with measurement errors arising from imprecise instruments, sampling errors, or other sources. These errors can significantly distort the in…
▽ More
Clustering analysis of functional data, which comprises observations that evolve continuously over time or space, has gained increasing attention across various scientific disciplines. Practical applications often involve functional data that are contaminated with measurement errors arising from imprecise instruments, sampling errors, or other sources. These errors can significantly distort the inherent data structure, resulting in erroneous clustering outcomes. In this paper, we propose a simulation-based approach designed to mitigate the impact of measurement errors. Our proposed method estimates the distribution of functional measurement errors through repeated measurements. Subsequently, the clustering algorithm is applied to simulated data generated from the conditional distribution of the unobserved true functional data given the observed contaminated functional data, accounting for the adjustments made to rectify measurement errors. We illustrate through simulations show that the proposed method has improved numerical performance than the naive methods that neglect such errors. Our proposed method was applied to a childhood obesity study, giving more reliable clustering results
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
When Swarm Learning meets energy series data: A decentralized collaborative learning design based on blockchain
Authors:
Lei Xu,
Yulong Chen,
Yuntian Chen,
Longfeng Nie,
Xuetao Wei,
Liang Xue,
Dongxiao Zhang
Abstract:
Machine learning models offer the capability to forecast future energy production or consumption and infer essential unknown variables from existing data. However, legal and policy constraints within specific energy sectors render the data sensitive, presenting technical hurdles in utilizing data from diverse sources. Therefore, we propose adopting a Swarm Learning (SL) scheme, which replaces the…
▽ More
Machine learning models offer the capability to forecast future energy production or consumption and infer essential unknown variables from existing data. However, legal and policy constraints within specific energy sectors render the data sensitive, presenting technical hurdles in utilizing data from diverse sources. Therefore, we propose adopting a Swarm Learning (SL) scheme, which replaces the centralized server with a blockchain-based distributed network to address the security and privacy issues inherent in Federated Learning (FL)'s centralized architecture. Within this distributed Collaborative Learning framework, each participating organization governs nodes for inter-organizational communication. Devices from various organizations utilize smart contracts for parameter uploading and retrieval. Consensus mechanism ensures distributed consistency throughout the learning process, guarantees the transparent trustworthiness and immutability of parameters on-chain. The efficacy of the proposed framework is substantiated across three real-world energy series modeling scenarios with superior performance compared to Local Learning approaches, simultaneously emphasizing enhanced data security and privacy over Centralized Learning and FL method. Notably, as the number of data volume and the count of local epochs increases within a threshold, there is an improvement in model performance accompanied by a reduction in the variance of performance errors. Consequently, this leads to an increased stability and reliability in the outcomes produced by the model.
△ Less
Submitted 7 June, 2024;
originally announced June 2024.
-
Federated Q-Learning with Reference-Advantage Decomposition: Almost Optimal Regret and Logarithmic Communication Cost
Authors:
Zhong Zheng,
Haochen Zhang,
Lingzhou Xue
Abstract:
In this paper, we consider model-free federated reinforcement learning for tabular episodic Markov decision processes. Under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. Despite recent advances in federated Q-learning algorithms achieving near-linear regret speedup with low communication co…
▽ More
In this paper, we consider model-free federated reinforcement learning for tabular episodic Markov decision processes. Under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. Despite recent advances in federated Q-learning algorithms achieving near-linear regret speedup with low communication cost, existing algorithms only attain suboptimal regrets compared to the information bound. We propose a novel model-free federated Q-learning algorithm, termed FedQ-Advantage. Our algorithm leverages reference-advantage decomposition for variance reduction and operates under two distinct mechanisms: synchronization between the agents and the server, and policy update, both triggered by events. We prove that our algorithm not only requires a lower logarithmic communication cost but also achieves an almost optimal regret, reaching the information bound up to a logarithmic factor and near-linear regret speedup compared to its single-agent counterpart when the time horizon is sufficiently large.
△ Less
Submitted 9 March, 2025; v1 submitted 29 May, 2024;
originally announced May 2024.
-
Towards Efficient Disaster Response via Cost-effective Unbiased Class Rate Estimation through Neyman Allocation Stratified Sampling Active Learning
Authors:
Yanbing Bai,
Xinyi Wu,
Lai Xu,
Jihan Pei,
Erick Mas,
Shunichi Koshimura
Abstract:
With the rapid development of earth observation technology, we have entered an era of massively available satellite remote-sensing data. However, a large amount of satellite remote sensing data lacks a label or the label cost is too high to hinder the potential of AI technology mining satellite data. Especially in such an emergency response scenario that uses satellite data to evaluate the degree…
▽ More
With the rapid development of earth observation technology, we have entered an era of massively available satellite remote-sensing data. However, a large amount of satellite remote sensing data lacks a label or the label cost is too high to hinder the potential of AI technology mining satellite data. Especially in such an emergency response scenario that uses satellite data to evaluate the degree of disaster damage. Disaster damage assessment encountered bottlenecks due to excessive focus on the damage of a certain building in a specific geographical space or a certain area on a larger scale. In fact, in the early days of disaster emergency response, government departments were more concerned about the overall damage rate of the disaster area instead of single-building damage, because this helps the government decide the level of emergency response. We present an innovative algorithm that constructs Neyman stratified random sampling trees for binary classification and extends this approach to multiclass problems. Through extensive experimentation on various datasets and model structures, our findings demonstrate that our method surpasses both passive and conventional active learning techniques in terms of class rate estimation and model enhancement with only 30\%-60\% of the annotation cost of simple sampling. It effectively addresses the 'sampling bias' challenge in traditional active learning strategies and mitigates the 'cold start' dilemma. The efficacy of our approach is further substantiated through application to disaster evaluation tasks using Xview2 Satellite imagery, showcasing its practical utility in real-world contexts.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Power-Enhanced Two-Sample Mean Tests for High-Dimensional Compositional Data with Application to Microbiome Data Analysis
Authors:
Danning Li,
Lingzhou Xue,
Haoyi Yang,
Xiufan Yu
Abstract:
Testing differences in mean vectors is a fundamental task in the analysis of high-dimensional compositional data. Existing methods may suffer from low power if the underlying signal pattern is in a situation that does not favor the deployed test. In this work, we develop two-sample power-enhanced mean tests for high-dimensional compositional data based on the combination of $p$-values, which integ…
▽ More
Testing differences in mean vectors is a fundamental task in the analysis of high-dimensional compositional data. Existing methods may suffer from low power if the underlying signal pattern is in a situation that does not favor the deployed test. In this work, we develop two-sample power-enhanced mean tests for high-dimensional compositional data based on the combination of $p$-values, which integrates strengths from two popular types of tests: the maximum-type test and the quadratic-type test. We provide rigorous theoretical guarantees on the proposed tests, showing accurate Type-I error rate control and enhanced testing power. Our method boosts the testing power towards a broader alternative space, which yields robust performance across a wide range of signal pattern settings. Our theory also contributes to the literature on power enhancement and Gaussian approximation for high-dimensional hypothesis testing. We demonstrate the performance of our method on both simulated data and real-world microbiome data, showing that our proposed approach improves the testing power substantially compared to existing methods.
△ Less
Submitted 7 March, 2025; v1 submitted 3 May, 2024;
originally announced May 2024.
-
Adjusting for bias due to measurement error in functional quantile regression models with error-prone functional and scalar covariates
Authors:
Xiwei Chen,
Yuanyuan Luan,
Roger S. Zoh,
Lan Xue,
Sneha Jadhav,
Carmen D. Tekwe
Abstract:
Wearable devices enable the continuous monitoring of physical activity (PA) but generate complex functional data with poorly characterized errors. Most work on functional data views the data as smooth, latent curves obtained at discrete time intervals with some random noise with mean zero and constant variance. Viewing this noise as homoscedastic and independent ignores potential serial correlatio…
▽ More
Wearable devices enable the continuous monitoring of physical activity (PA) but generate complex functional data with poorly characterized errors. Most work on functional data views the data as smooth, latent curves obtained at discrete time intervals with some random noise with mean zero and constant variance. Viewing this noise as homoscedastic and independent ignores potential serial correlations. Our preliminary studies indicate that failing to account for these serial correlations can bias estimations. In dietary assessments, epidemiologists often use self-reported measures based on food frequency questionnaires that are prone to recall bias. With the increased availability of complex, high-dimensional functional, and scalar biomedical data potentially prone to measurement errors, it is necessary to adjust for biases induced by these errors to permit accurate analyses in various regression settings. However, there has been limited work to address measurement errors in functional and scalar covariates in the context of quantile regression. Therefore, we developed new statistical methods based on simulation extrapolation (SIMEX) and mixed effects regression with repeated measures to correct for measurement error biases in this context. We conducted simulation studies to establish the finite sample properties of our new methods. The methods are illustrated through application to a real data set.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
A Unified Combination Framework for Dependent Tests with Applications to Microbiome Association Studies
Authors:
Xiufan Yu,
Linjun Zhang,
Arun Srinivasan,
Min-ge Xie,
Lingzhou Xue
Abstract:
We introduce a novel meta-analysis framework to combine dependent tests under a general setting, and utilize it to synthesize various microbiome association tests that are calculated from the same dataset. Our development builds upon the classical meta-analysis methods of aggregating $p$-values and also a more recent general method of combining confidence distributions, but makes generalizations t…
▽ More
We introduce a novel meta-analysis framework to combine dependent tests under a general setting, and utilize it to synthesize various microbiome association tests that are calculated from the same dataset. Our development builds upon the classical meta-analysis methods of aggregating $p$-values and also a more recent general method of combining confidence distributions, but makes generalizations to handle dependent tests. The proposed framework ensures rigorous statistical guarantees, and we provide a comprehensive study and compare it with various existing dependent combination methods. Notably, we demonstrate that the widely used Cauchy combination method for dependent tests, referred to as the vanilla Cauchy combination in this article, can be viewed as a special case within our framework. Moreover, the proposed framework provides a way to address the problem when the distributional assumptions underlying the vanilla Cauchy combination are violated. Our numerical results demonstrate that ignoring the dependence among the to-be-combined components may lead to a severe size distortion phenomenon. Compared to the existing $p$-value combination methods, including the vanilla Cauchy combination method, the proposed combination framework can handle the dependence accurately and utilizes the information efficiently to construct tests with accurate size and enhanced power. The development is applied to Microbiome Association Studies, where we aggregate information from multiple existing tests using the same dataset. The combined tests harness the strengths of each individual test across a wide range of alternative spaces, %resulting in a significant enhancement of testing power across a wide range of alternative spaces, enabling more efficient and meaningful discoveries of vital microbiome associations.
△ Less
Submitted 14 April, 2024;
originally announced April 2024.
-
A Copula Graphical Model for Multi-Attribute Data using Optimal Transport
Authors:
Qi Zhang,
Bing Li,
Lingzhou Xue
Abstract:
Motivated by modern data forms such as images and multi-view data, the multi-attribute graphical model aims to explore the conditional independence structure among vectors. Under the Gaussian assumption, the conditional independence between vectors is characterized by blockwise zeros in the precision matrix. To relax the restrictive Gaussian assumption, in this paper, we introduce a novel semipara…
▽ More
Motivated by modern data forms such as images and multi-view data, the multi-attribute graphical model aims to explore the conditional independence structure among vectors. Under the Gaussian assumption, the conditional independence between vectors is characterized by blockwise zeros in the precision matrix. To relax the restrictive Gaussian assumption, in this paper, we introduce a novel semiparametric multi-attribute graphical model based on a new copula named Cyclically Monotone Copula. This new copula treats the distribution of the node vectors as multivariate marginals and transforms them into Gaussian distributions based on the optimal transport theory. Since the model allows the node vectors to have arbitrary continuous distributions, it is more flexible than the classical Gaussian copula method that performs coordinatewise Gaussianization. We establish the concentration inequalities of the estimated covariance matrices and provide sufficient conditions for selection consistency of the group graphical lasso estimator. For the setting with high-dimensional attributes, a {Projected Cyclically Monotone Copula} model is proposed to address the curse of dimensionality issue that arises from solving high-dimensional optimal transport problems. Numerical results based on synthetic and real data show the efficiency and flexibility of our methods.
△ Less
Submitted 10 April, 2024;
originally announced April 2024.
-
Context in Public Health for Underserved Communities: A Bayesian Approach to Online Restless Bandits
Authors:
Biyonka Liang,
Lily Xu,
Aparna Taneja,
Milind Tambe,
Lucas Janson
Abstract:
Public health programs often provide interventions to encourage program adherence, and effectively allocating interventions is vital for producing the greatest overall health outcomes, especially in underserved communities where resources are limited. Such resource allocation problems are often modeled as restless multi-armed bandits (RMABs) with unknown underlying transition dynamics, hence requi…
▽ More
Public health programs often provide interventions to encourage program adherence, and effectively allocating interventions is vital for producing the greatest overall health outcomes, especially in underserved communities where resources are limited. Such resource allocation problems are often modeled as restless multi-armed bandits (RMABs) with unknown underlying transition dynamics, hence requiring online reinforcement learning (RL). We present Bayesian Learning for Contextual RMABs (BCoR), an online RL approach for RMABs that novelly combines techniques in Bayesian modeling with Thompson sampling to flexibly model the complex RMAB settings present in public health program adherence problems, namely context and non-stationarity. BCoR's key strength is the ability to leverage shared information within and between arms to learn the unknown RMAB transition dynamics quickly in intervention-scarce settings with relatively short time horizons, which is common in public health applications. Empirically, BCoR achieves substantially higher finite-sample performance over a range of experimental settings, including a setting using real-world adherence data that was developed in collaboration with ARMMAN, an NGO in India which runs a large-scale maternal mHealth program, showcasing BCoR practical utility and potential for real-world deployment.
△ Less
Submitted 5 February, 2025; v1 submitted 7 February, 2024;
originally announced February 2024.
-
A Penalized Functional Linear Cox Regression Model for Spatially-defined Environmental Exposure with an Estimated Buffer Distance
Authors:
Jooyoung Lee,
Zhibing He,
Charlotte Roscoe,
Peter James,
Li Xu,
Donna Spiegelman,
David Zucker,
Molin Wang
Abstract:
In environmental health research, it is of interest to understand the effect of the neighborhood environment on health. Researchers have shown a protective association between green space around a person's residential address and depression outcomes. In measuring exposure to green space, distance buffers are often used. However, buffer distances differ across studies. Typically, the buffer distanc…
▽ More
In environmental health research, it is of interest to understand the effect of the neighborhood environment on health. Researchers have shown a protective association between green space around a person's residential address and depression outcomes. In measuring exposure to green space, distance buffers are often used. However, buffer distances differ across studies. Typically, the buffer distance is determined by researchers a priori. It is unclear how to identify an appropriate buffer distance for exposure assessment. To address geographic uncertainty problem for exposure assessment, we present a domain selection algorithm based on the penalized functional linear Cox regression model. The theoretical properties of our proposed method are studied and simulation studies are conducted to evaluate finite sample performances of our method. The proposed method is illustrated in a study of associations of green space exposure with depression and/or antidepressant use in the Nurses' Health Study.
△ Less
Submitted 31 December, 2023;
originally announced January 2024.
-
Federated Q-Learning: Linear Regret Speedup with Low Communication Cost
Authors:
Zhong Zheng,
Fengyu Gao,
Lingzhou Xue,
Jing Yang
Abstract:
In this paper, we consider federated reinforcement learning for tabular episodic Markov Decision Processes (MDP) where, under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. While linear speedup in the number of agents has been achieved for some metrics, such as convergence rate and sample com…
▽ More
In this paper, we consider federated reinforcement learning for tabular episodic Markov Decision Processes (MDP) where, under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. While linear speedup in the number of agents has been achieved for some metrics, such as convergence rate and sample complexity, in similar settings, it is unclear whether it is possible to design a model-free algorithm to achieve linear regret speedup with low communication cost. We propose two federated Q-Learning algorithms termed as FedQ-Hoeffding and FedQ-Bernstein, respectively, and show that the corresponding total regrets achieve a linear speedup compared with their single-agent counterparts when the time horizon is sufficiently large, while the communication cost scales logarithmically in the total number of time steps $T$. Those results rely on an event-triggered synchronization mechanism between the agents and the server, a novel step size selection when the server aggregates the local estimates of the state-action values to form the global estimates, and a set of new concentration inequalities to bound the sum of non-martingale differences. This is the first work showing that linear regret speedup and logarithmic communication cost can be achieved by model-free algorithms in federated reinforcement learning.
△ Less
Submitted 7 May, 2024; v1 submitted 22 December, 2023;
originally announced December 2023.
-
Bayesian Nonparametric Clustering with Feature Selection for Spatially Resolved Transcriptomics Data
Authors:
Bencong Zhu,
Guanyu Hu,
Yang Xie,
Lin Xu,
Xiaodan Fan,
Qiwei Li
Abstract:
The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has reshaped genomic studies by enabling high-throughput gene expression profiling while preserving spatial and morphological context. Nevertheless, there are inherent challenges associated with these new high-dimensional spatial data, such as zero-inflation, over-dispersion, and heterogeneity. These…
▽ More
The advent of next-generation sequencing-based spatially resolved transcriptomics (SRT) techniques has reshaped genomic studies by enabling high-throughput gene expression profiling while preserving spatial and morphological context. Nevertheless, there are inherent challenges associated with these new high-dimensional spatial data, such as zero-inflation, over-dispersion, and heterogeneity. These challenges pose obstacles to effective clustering, which is a fundamental problem in SRT data analysis. Current computational approaches often rely on heuristic data preprocessing and arbitrary cluster number prespecification, leading to considerable information loss and consequently, suboptimal downstream analysis. In response to these challenges, we introduce BNPSpace, a novel Bayesian nonparametric spatial clustering framework that directly models SRT count data. BNPSpace facilitates the partitioning of the whole spatial domain, which is characterized by substantial heterogeneity, into homogeneous spatial domains with similar molecular characteristics while identifying a parsimonious set of discriminating genes among different spatial domains. Moreover, BNPSpace incorporates spatial information through a Markov random field prior model, encouraging a smooth and biologically meaningful partition pattern.
△ Less
Submitted 13 December, 2023;
originally announced December 2023.
-
Deep Neural Network Identification of Limnonectes Species and New Class Detection Using Image Data
Authors:
Li Xu,
Yili Hong,
Eric P. Smith,
David S. McLeod,
Xinwei Deng,
Laura J. Freeman
Abstract:
As is true of many complex tasks, the work of discovering, describing, and understanding the diversity of life on Earth (viz., biological systematics and taxonomy) requires many tools. Some of this work can be accomplished as it has been done in the past, but some aspects present us with challenges which traditional knowledge and tools cannot adequately resolve. One such challenge is presented by…
▽ More
As is true of many complex tasks, the work of discovering, describing, and understanding the diversity of life on Earth (viz., biological systematics and taxonomy) requires many tools. Some of this work can be accomplished as it has been done in the past, but some aspects present us with challenges which traditional knowledge and tools cannot adequately resolve. One such challenge is presented by species complexes in which the morphological similarities among the group members make it difficult to reliably identify known species and detect new ones. We address this challenge by developing new tools using the principles of machine learning to resolve two specific questions related to species complexes. The first question is formulated as a classification problem in statistics and machine learning and the second question is an out-of-distribution (OOD) detection problem. We apply these tools to a species complex comprising Southeast Asian stream frogs (Limnonectes kuhlii complex) and employ a morphological character (hind limb skin texture) traditionally treated qualitatively in a quantitative and objective manner. We demonstrate that deep neural networks can successfully automate the classification of an image into a known species group for which it has been trained. We further demonstrate that the algorithm can successfully classify an image into a new class if the image does not belong to the existing classes. Additionally, we use the larger MNIST dataset to test the performance of our OOD detection algorithm. We finish our paper with some concluding remarks regarding the application of these methods to species complexes and our efforts to document true biodiversity. This paper has online supplementary materials.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
The Memory Perturbation Equation: Understanding Model's Sensitivity to Data
Authors:
Peter Nickl,
Lu Xu,
Dharmesh Tailor,
Thomas Möllenhoff,
Mohammad Emtiyaz Khan
Abstract:
Understanding model's sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model's sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of…
▽ More
Understanding model's sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model's sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of models and algorithms, and unravels useful properties regarding sensitivities. Our empirical results show that sensitivity estimates obtained during training can be used to faithfully predict generalization on unseen test data. The proposed equation is expected to be useful for future research on robust and adaptive learning.
△ Less
Submitted 16 January, 2024; v1 submitted 30 October, 2023;
originally announced October 2023.
-
Nonlinear global Fréchet regression for random objects via weak conditional expectation
Authors:
Satarupa Bhattacharjee,
Bing Li,
Lingzhou Xue
Abstract:
Random objects are complex non-Euclidean data taking value in general metric space, possibly devoid of any underlying vector space structure. Such data are getting increasingly abundant with the rapid advancement in technology. Examples include probability distributions, positive semi-definite matrices, and data on Riemannian manifolds. However, except for regression for object-valued response wit…
▽ More
Random objects are complex non-Euclidean data taking value in general metric space, possibly devoid of any underlying vector space structure. Such data are getting increasingly abundant with the rapid advancement in technology. Examples include probability distributions, positive semi-definite matrices, and data on Riemannian manifolds. However, except for regression for object-valued response with Euclidean predictors and distribution-on-distribution regression, there has been limited development of a general framework for object-valued response with object-valued predictors in the literature. To fill this gap, we introduce the notion of a weak conditional Fréchet mean based on Carleman operators and then propose a global nonlinear Fréchet regression model through the reproducing kernel Hilbert space (RKHS) embedding. Furthermore, we establish the relationships between the conditional Fréchet mean and the weak conditional Fréchet mean for both Euclidean and object-valued data. We also show that the state-of-the-art global Fréchet regression developed by Petersen and Mueller, 2019 emerges as a special case of our method by choosing a linear kernel. We require that the metric space for the predictor admits a reproducing kernel, while the intrinsic geometry of the metric space for the response is utilized to study the asymptotic properties of the proposed estimates. Numerical studies, including extensive simulations and a real application, are conducted to investigate the performance of our estimator in a finite sample.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
Kernel Single Proxy Control for Deterministic Confounding
Authors:
Liyuan Xu,
Arthur Gretton
Abstract:
We consider the problem of causal effect estimation with an unobserved confounder, where we observe a single proxy variable that is associated with the confounder. Although it has been shown that the recovery of an average causal effect is impossible in general from a single proxy variable, we show that causal recovery is possible if the outcome is generated deterministically. This generalizes exi…
▽ More
We consider the problem of causal effect estimation with an unobserved confounder, where we observe a single proxy variable that is associated with the confounder. Although it has been shown that the recovery of an average causal effect is impossible in general from a single proxy variable, we show that causal recovery is possible if the outcome is generated deterministically. This generalizes existing work on causal methods with a single proxy variable to the continuous treatment setting. We propose two kernel-based methods for this setting: the first based on the two-stage regression approach, and the second based on a maximum moment restriction approach. We prove that both approaches can consistently estimate the causal effect, and we empirically demonstrate that we can successfully recover the causal effect on challenging synthetic benchmarks.
△ Less
Submitted 18 March, 2025; v1 submitted 8 August, 2023;
originally announced August 2023.
-
Relabeling Minimal Training Subset to Flip a Prediction
Authors:
Jinghan Yang,
Linjie Xu,
Lequan Yu
Abstract:
When facing an unsatisfactory prediction from a machine learning model, users can be interested in investigating the underlying reasons and exploring the potential for reversing the outcome. We ask: To flip the prediction on a test point $x_t$, how to identify the smallest training subset $\mathcal{S}_t$ that we need to relabel? We propose an efficient algorithm to identify and relabel such a subs…
▽ More
When facing an unsatisfactory prediction from a machine learning model, users can be interested in investigating the underlying reasons and exploring the potential for reversing the outcome. We ask: To flip the prediction on a test point $x_t$, how to identify the smallest training subset $\mathcal{S}_t$ that we need to relabel? We propose an efficient algorithm to identify and relabel such a subset via an extended influence function for binary classification models with convex loss. We find that relabeling fewer than 2% of the training points can always flip a prediction. This mechanism can serve multiple purposes: (1) providing an approach to challenge a model prediction by altering training points; (2) evaluating model robustness with the cardinality of the subset (i.e., $|\mathcal{S}_t|$); we show that $|\mathcal{S}_t|$ is highly related to the noise ratio in the training set and $|\mathcal{S}_t|$ is correlated with but complementary to predicted probabilities; and (3) revealing training points lead to group attribution bias. To the best of our knowledge, we are the first to investigate identifying and relabeling the minimal training subset required to flip a given prediction.
△ Less
Submitted 3 February, 2024; v1 submitted 22 May, 2023;
originally announced May 2023.
-
A New Inexact Proximal Linear Algorithm with Adaptive Stopping Criteria for Robust Phase Retrieval
Authors:
Zhong Zheng,
Shiqian Ma,
Lingzhou Xue
Abstract:
This paper considers the robust phase retrieval problem, which can be cast as a nonsmooth and nonconvex optimization problem. We propose a new inexact proximal linear algorithm with the subproblem being solved inexactly. Our contributions are two adaptive stopping criteria for the subproblem. The convergence behavior of the proposed methods is analyzed. Through experiments on both synthetic and re…
▽ More
This paper considers the robust phase retrieval problem, which can be cast as a nonsmooth and nonconvex optimization problem. We propose a new inexact proximal linear algorithm with the subproblem being solved inexactly. Our contributions are two adaptive stopping criteria for the subproblem. The convergence behavior of the proposed methods is analyzed. Through experiments on both synthetic and real datasets, we demonstrate that our methods are much more efficient than existing methods, such as the original proximal linear algorithm and the subgradient method.
△ Less
Submitted 8 February, 2024; v1 submitted 24 April, 2023;
originally announced April 2023.
-
Generalized functional linear regression models with a mixture of complex function-valued and scalar-valued covariates prone to measurement error
Authors:
Yuanyuan Luan,
Roger S. Zoh,
Sneha Jadhav,
Lan Xue,
Carmen D. Tekwe
Abstract:
While extensive work has been done to correct for biases due to measurement error in scalar-valued covariates prone to errors in generalized linear regression models, limited work has been done to address biases associated with functional covariates prone to errors or the combination of scalar and functional covariates prone to errors in these models. We propose Simulation Extrapolation (SIMEX) an…
▽ More
While extensive work has been done to correct for biases due to measurement error in scalar-valued covariates prone to errors in generalized linear regression models, limited work has been done to address biases associated with functional covariates prone to errors or the combination of scalar and functional covariates prone to errors in these models. We propose Simulation Extrapolation (SIMEX) and Regression Calibration approaches to correct measurement errors associated with a mixture of functional and scalar covariates prone to classical measurement errors in generalized functional linear regression. The simulation extrapolation method is developed to handle the functional and scalar covariates prone to errors. We also develop methods based on regression calibration extended to our current measurement error settings. Extensive simulation studies are conducted to assess the finite sample performance of our developed methods. The methods are applied to the 2011-2014 cycles of the National Health and Examination Survey data to assess the relationship between physical activity and total caloric intake with type 2 diabetes among community-dwelling adults living in the United States. We treat the device-based measures of physical activity as error-prone functional covariates prone to complex arbitrary heteroscedastic errors, while the total caloric intake is considered a scalar-valued covariate prone to error. We also examine the characteristics of observed measurement errors in device-based physical activity by important demographic subgroups including age, sex, and race.
△ Less
Submitted 12 May, 2023; v1 submitted 4 April, 2023;
originally announced April 2023.
-
A Graphical Point Process Framework for Understanding Removal Effects in Multi-Touch Attribution
Authors:
Jun Tao,
Qian Chen,
James W. Snyder Jr.,
Arava Sai Kumar,
Amirhossein Meisami,
Lingzhou Xue
Abstract:
Marketers employ various online advertising channels to reach customers, and they are particularly interested in attribution for measuring the degree to which individual touchpoints contribute to an eventual conversion. The availability of individual customer-level path-to-purchase data and the increasing number of online marketing channels and types of touchpoints bring new challenges to this fun…
▽ More
Marketers employ various online advertising channels to reach customers, and they are particularly interested in attribution for measuring the degree to which individual touchpoints contribute to an eventual conversion. The availability of individual customer-level path-to-purchase data and the increasing number of online marketing channels and types of touchpoints bring new challenges to this fundamental problem. We aim to tackle the attribution problem with finer granularity by conducting attribution at the path level. To this end, we develop a novel graphical point process framework to study the direct conversion effects and the full relational structure among numerous types of touchpoints simultaneously. Utilizing the temporal point process of conversion and the graphical structure, we further propose graphical attribution methods to allocate proper path-level conversion credit, called the attribution score, to individual touchpoints or corresponding channels for each customer's path to purchase. Our proposed attribution methods consider the attribution score as the removal effect, and we use the rigorous probabilistic definition to derive two types of removal effects. We examine the performance of our proposed methods in extensive simulation studies and compare their performance with commonly used attribution models. We also demonstrate the performance of the proposed methods in a real-world attribution application.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Theoretical Guarantees for Sparse Principal Component Analysis based on the Elastic Net
Authors:
Teng Zhang,
Haoyi Yang,
Lingzhou Xue
Abstract:
Sparse principal component analysis (SPCA) is widely used for dimensionality reduction and feature extraction in high-dimensional data analysis. Despite many methodological and theoretical developments in the past two decades, the theoretical guarantees of the popular SPCA algorithm proposed by Zou, Hastie & Tibshirani (2006) are still unknown. This paper aims to address this critical gap. We firs…
▽ More
Sparse principal component analysis (SPCA) is widely used for dimensionality reduction and feature extraction in high-dimensional data analysis. Despite many methodological and theoretical developments in the past two decades, the theoretical guarantees of the popular SPCA algorithm proposed by Zou, Hastie & Tibshirani (2006) are still unknown. This paper aims to address this critical gap. We first revisit the SPCA algorithm of Zou et al. (2006) and present our implementation. We also study a computationally more efficient variant of the SPCA algorithm in Zou et al. (2006) that can be considered as the limiting case of SPCA. We provide the guarantees of convergence to a stationary point for both algorithms and prove that, under a sparse spiked covariance model, both algorithms can recover the principal subspace consistently under mild regularity conditions. We show that their estimation error bounds match the best available bounds of existing works or the minimax rates up to some logarithmic factors. Moreover, we demonstrate the competitive numerical performance of both algorithms in numerical studies.
△ Less
Submitted 27 April, 2023; v1 submitted 29 December, 2022;
originally announced December 2022.
-
Distribution Estimation of Contaminated Data via DNN-based MoM-GANs
Authors:
Fang Xie,
Lihu Xu,
Qiuran Yao,
Huiming Zhang
Abstract:
This paper studies the distribution estimation of contaminated data by the MoM-GAN method, which combines generative adversarial net (GAN) and median-of-mean (MoM) estimation. We use a deep neural network (DNN) with a ReLU activation function to model the generator and discriminator of the GAN. Theoretically, we derive a non-asymptotic error bound for the DNN-based MoM-GAN estimator measured by in…
▽ More
This paper studies the distribution estimation of contaminated data by the MoM-GAN method, which combines generative adversarial net (GAN) and median-of-mean (MoM) estimation. We use a deep neural network (DNN) with a ReLU activation function to model the generator and discriminator of the GAN. Theoretically, we derive a non-asymptotic error bound for the DNN-based MoM-GAN estimator measured by integral probability metrics with the $b$-smoothness Hölder class. The error bound decreases essentially as $n^{-b/p}\vee n^{-1/2}$, where $n$ and $p$ are the sample size and the dimension of input data. We give an algorithm for the MoM-GAN method and implement it through two real applications. The numerical results show that the MoM-GAN outperforms other competitive methods when dealing with contaminated data.
△ Less
Submitted 28 December, 2022;
originally announced December 2022.
-
A Neural Mean Embedding Approach for Back-door and Front-door Adjustment
Authors:
Liyuan Xu,
Arthur Gretton
Abstract:
We consider the estimation of average and counterfactual treatment effects, under two settings: back-door adjustment and front-door adjustment. The goal in both cases is to recover the treatment effect without having an access to a hidden confounder. This objective is attained by first estimating the conditional mean of the desired outcome variable given relevant covariates (the "first stage" regr…
▽ More
We consider the estimation of average and counterfactual treatment effects, under two settings: back-door adjustment and front-door adjustment. The goal in both cases is to recover the treatment effect without having an access to a hidden confounder. This objective is attained by first estimating the conditional mean of the desired outcome variable given relevant covariates (the "first stage" regression), and then taking the (conditional) expectation of this function as a "second stage" procedure. We propose to compute these conditional expectations directly using a regression function to the learned input features of the first stage, thus avoiding the need for sampling or density estimation. All functions and features (and in particular, the output features in the second stage) are neural networks learned adaptively from data, with the sole requirement that the final layer of the first stage should be linear. The proposed method is shown to converge to the true causal parameter, and outperforms the recent state-of-the-art methods on challenging causal benchmarks, including settings involving high-dimensional image data.
△ Less
Submitted 12 October, 2022;
originally announced October 2022.
-
Artificial Replay: A Meta-Algorithm for Harnessing Historical Data in Bandits
Authors:
Siddhartha Banerjee,
Sean R. Sinclair,
Milind Tambe,
Lily Xu,
Christina Lee Yu
Abstract:
Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data…
▽ More
Most real-world deployments of bandit algorithms exist somewhere in between the offline and online set-up, where some historical data is available upfront and additional data is collected dynamically online. How best to incorporate historical data to "warm start" bandit algorithms is an open question: naively initializing reward estimates using all historical samples can suffer from spurious data and imbalanced data coverage, leading to data inefficiency (amount of historical data used) - particularly for continuous action spaces. To address these challenges, we propose ArtificialReplay, a meta-algorithm for incorporating historical data into any arbitrary base bandit algorithm. We show that ArtificialReplay uses only a fraction of the historical data compared to a full warm-start approach, while still achieving identical regret for base algorithms that satisfy independence of irrelevant data (IIData), a novel and broadly applicable property that we introduce. We complement these theoretical results with experiments on K-armed bandits and continuous combinatorial bandits, on which we model green security domains using real poaching data. Our results show the practical benefits of ArtificialReplay for improving data efficiency, including for base algorithms that do not satisfy IIData.
△ Less
Submitted 19 March, 2025; v1 submitted 30 September, 2022;
originally announced October 2022.
-
Hypothesis Testing for Detecting Outlier Evaluators
Authors:
Li Xu,
Molin Wang
Abstract:
In epidemiological studies, very often, evaluators obtain measurements of disease outcomes for study participants. In this paper, we propose a two-stage procedure for detecting outlier evaluators. In the first stage, a regression model is fitted to obtain the evaluators' effects. The outlier evaluators are considered as those with different effects compared with the normal evaluators. In the secon…
▽ More
In epidemiological studies, very often, evaluators obtain measurements of disease outcomes for study participants. In this paper, we propose a two-stage procedure for detecting outlier evaluators. In the first stage, a regression model is fitted to obtain the evaluators' effects. The outlier evaluators are considered as those with different effects compared with the normal evaluators. In the second stage, stepwise hypothesis testings are performed to detect outlier evaluators. The true positive rate and true negative rate of the proposed procedure are assessed in a simulation study. We apply the proposed method to detect potential outlier audiologists among the audiologists who measured hearing threshold levels of the participants in the Audiology Assessment Arm of the Conservation of Hearing Study, which is an epidemiological study for examining risk factors of hearing loss.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
Nonlinear Sufficient Dimension Reduction for Distribution-on-Distribution Regression
Authors:
Qi Zhang,
Bing Li,
Lingzhou Xue
Abstract:
We introduce a new approach to nonlinear sufficient dimension reduction in cases where both the predictor and the response are distributional data, modeled as members of a metric space. Our key step is to build universal kernels (cc-universal) on the metric spaces, which results in reproducing kernel Hilbert spaces for the predictor and response that are rich enough to characterize the conditional…
▽ More
We introduce a new approach to nonlinear sufficient dimension reduction in cases where both the predictor and the response are distributional data, modeled as members of a metric space. Our key step is to build universal kernels (cc-universal) on the metric spaces, which results in reproducing kernel Hilbert spaces for the predictor and response that are rich enough to characterize the conditional independence that determines sufficient dimension reduction. For univariate distributions, we construct the universal kernel using the Wasserstein distance, while for multivariate distributions, we resort to the sliced Wasserstein distance. The sliced Wasserstein distance ensures that the metric space possesses similar topological properties to the Wasserstein space while also offering significant computation benefits. Numerical results based on synthetic data show that our method outperforms possible competing methods. The method is also applied to several data sets, including fertility and mortality data and Calgary temperature data.
△ Less
Submitted 24 April, 2023; v1 submitted 11 July, 2022;
originally announced July 2022.
-
Prediction for Distributional Outcomes in High-Performance Computing I/O Variability
Authors:
Li Xu,
Yili Hong,
Max D. Morris,
Kirk W. Cameron
Abstract:
Although high-performance computing (HPC) systems have been scaled to meet the exponentially-growing demand for scientific computing, HPC performance variability remains a major challenge and has become a critical research topic in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performanc…
▽ More
Although high-performance computing (HPC) systems have been scaled to meet the exponentially-growing demand for scientific computing, HPC performance variability remains a major challenge and has become a critical research topic in computer science. Statistically, performance variability can be characterized by a distribution. Predicting performance variability is a critical step in HPC performance variability management and is nontrivial because one needs to predict a distribution function based on system factors. In this paper, we propose a new framework to predict performance distributions. The proposed model is a modified Gaussian process that can predict the distribution function of the input/output (I/O) throughput under a specific HPC system configuration. We also impose a monotonic constraint so that the predicted function is nondecreasing, which is a property of the cumulative distribution function. Additionally, the proposed model can incorporate both quantitative and qualitative input variables. We evaluate the performance of the proposed method by using the IOzone variability data based on various prediction tasks. Results show that the proposed method can generate accurate predictions, and outperform existing methods. We also show how the predicted functional output can be used to generate predictions for a scalar summary of the performance distribution, such as the mean, standard deviation, and quantiles. Our methods can be further used as a surrogate model for HPC system variability monitoring and optimization.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
Validating Causal Inference Methods
Authors:
Harsh Parikh,
Carlos Varjao,
Louise Xu,
Eric Tchetgen Tchetgen
Abstract:
The fundamental challenge of drawing causal inference is that counterfactual outcomes are not fully observed for any unit. Furthermore, in observational studies, treatment assignment is likely to be confounded. Many statistical methods have emerged for causal inference under unconfoundedness conditions given pre-treatment covariates, including propensity score-based methods, prognostic score-based…
▽ More
The fundamental challenge of drawing causal inference is that counterfactual outcomes are not fully observed for any unit. Furthermore, in observational studies, treatment assignment is likely to be confounded. Many statistical methods have emerged for causal inference under unconfoundedness conditions given pre-treatment covariates, including propensity score-based methods, prognostic score-based methods, and doubly robust methods. Unfortunately for applied researchers, there is no `one-size-fits-all' causal method that can perform optimally universally. In practice, causal methods are primarily evaluated quantitatively on handcrafted simulated data. Such data-generative procedures can be of limited value because they are typically stylized models of reality. They are simplified for tractability and lack the complexities of real-world data. For applied researchers, it is critical to understand how well a method performs for the data at hand. Our work introduces a deep generative model-based framework, Credence, to validate causal inference methods. The framework's novelty stems from its ability to generate synthetic data anchored at the empirical distribution for the observed sample, and therefore virtually indistinguishable from the latter. The approach allows the user to specify ground truth for the form and magnitude of causal effects and confounding bias as functions of covariates. Thus simulated data sets are used to evaluate the potential performance of various causal estimation methods when applied to data similar to the observed sample. We demonstrate Credence's ability to accurately assess the relative performance of causal estimation techniques in an extensive simulation study and two real-world data applications from Lalonde and Project STAR studies.
△ Less
Submitted 29 July, 2022; v1 submitted 8 February, 2022;
originally announced February 2022.
-
Importance Weighting Approach in Kernel Bayes' Rule
Authors:
Liyuan Xu,
Yutian Chen,
Arnaud Doucet,
Arthur Gretton
Abstract:
We study a nonparametric approach to Bayesian computation via feature means, where the expectation of prior features is updated to yield expected kernel posterior features, based on regression from learned neural net or kernel features of the observations. All quantities involved in the Bayesian update are learned from observed data, making the method entirely model-free. The resulting algorithm i…
▽ More
We study a nonparametric approach to Bayesian computation via feature means, where the expectation of prior features is updated to yield expected kernel posterior features, based on regression from learned neural net or kernel features of the observations. All quantities involved in the Bayesian update are learned from observed data, making the method entirely model-free. The resulting algorithm is a novel instance of a kernel Bayes' rule (KBR), based on importance weighting. This results in superior numerical stability to the original approach to KBR, which requires operator inversion. We show the convergence of the estimator using a novel consistency analysis on the importance weighting estimator in the infinity norm. We evaluate KBR on challenging synthetic benchmarks, including a filtering problem with a state-space model involving high dimensional image observations. Importance weighted KBR yields uniformly better empirical performance than the original KBR, and competitive performance with other competing methods.
△ Less
Submitted 10 August, 2022; v1 submitted 4 February, 2022;
originally announced February 2022.
-
Design Strategies and Approximation Methods for High-Performance Computing Variability Management
Authors:
Yueyao Wang,
Li Xu,
Yili Hong,
Rong Pan,
Tyler Chang,
Thomas Lux,
Jon Bernard,
Layne Watson,
Kirk Cameron
Abstract:
Performance variability management is an active research area in high-performance computing (HPC). We focus on input/output (I/O) variability. To study the performance variability, computer scientists often use grid-based designs (GBDs) to collect I/O variability data, and use mathematical approximation methods to build a prediction model. Mathematical approximation models could be biased particul…
▽ More
Performance variability management is an active research area in high-performance computing (HPC). We focus on input/output (I/O) variability. To study the performance variability, computer scientists often use grid-based designs (GBDs) to collect I/O variability data, and use mathematical approximation methods to build a prediction model. Mathematical approximation models could be biased particularly if extrapolations are needed. Space-filling designs (SFDs) and surrogate models such as Gaussian process (GP) are popular for data collection and building predictive models. The applicability of SFDs and surrogates in the HPC variability needs investigation. We investigate their applicability in the HPC setting in terms of design efficiency, prediction accuracy, and scalability. We first customize the existing SFDs so that they can be applied in the HPC setting. We conduct a comprehensive investigation of design strategies and the prediction ability of approximation methods. We use both synthetic data simulated from three test functions and the real data from the HPC setting. We then compare different methods in terms of design efficiency, prediction accuracy, and scalability. In synthetic and real data analysis, GP with SFDs outperforms in most scenarios. With respect to approximation models, GP is recommended if the data are collected by SFDs. If data are collected using GBDs, both GP and Delaunay can be considered. With the best choice of approximation method, the performance of SFDs and GBD depends on the property of the underlying surface. For the cases in which SFDs perform better, the number of design points needed for SFDs is about half of or less than that of the GBD to achieve the same prediction accuracy. SFDs that can be tailored to high dimension and non-smooth surface are recommended especially when large numbers of input factors need to be considered in the model.
△ Less
Submitted 24 January, 2022;
originally announced January 2022.
-
Non-Asymptotic Guarantees for Robust Statistical Learning under Infinite Variance Assumption
Authors:
Lihu Xu,
Fang Yao,
Qiuran Yao,
Huiming Zhang
Abstract:
There has been a surge of interest in developing robust estimators for models with heavy-tailed and bounded variance data in statistics and machine learning, while few works impose unbounded variance. This paper proposes two type of robust estimators, the ridge log-truncated M-estimator and the elastic net log-truncated M-estimator. The first estimator is applied to convex regressions such as quan…
▽ More
There has been a surge of interest in developing robust estimators for models with heavy-tailed and bounded variance data in statistics and machine learning, while few works impose unbounded variance. This paper proposes two type of robust estimators, the ridge log-truncated M-estimator and the elastic net log-truncated M-estimator. The first estimator is applied to convex regressions such as quantile regression and generalized linear models, while the other one is applied to high dimensional non-convex learning problems such as regressions via
deep neural networks. Simulations and real data analysis demonstrate the {robustness} of log-truncated estimations over standard estimations.
△ Less
Submitted 11 October, 2022; v1 submitted 10 January, 2022;
originally announced January 2022.
-
An additive graphical model for discrete data
Authors:
Jun Tao,
Bing Li,
Lingzhou Xue
Abstract:
We introduce a nonparametric graphical model for discrete node variables based on additive conditional independence. Additive conditional independence is a three way statistical relation that shares similar properties with conditional independence by satisfying the semi-graphoid axioms. Based on this relation we build an additive graphical model for discrete variables that does not suffer from the…
▽ More
We introduce a nonparametric graphical model for discrete node variables based on additive conditional independence. Additive conditional independence is a three way statistical relation that shares similar properties with conditional independence by satisfying the semi-graphoid axioms. Based on this relation we build an additive graphical model for discrete variables that does not suffer from the restriction of a parametric model such as the Ising model. We develop an estimator of the new graphical model via the penalized estimation of the discrete version of the additive precision operator and establish the consistency of the estimator under the ultrahigh-dimensional setting. Along with these methodological developments, we also exploit the properties of discrete random variables to uncover a deeper relation between additive conditional independence and conditional independence than previously known. The new graphical model reduces to a conditional independence graphical model under certain sparsity conditions. We conduct simulation experiments and analysis of an HIV antiretroviral therapy data set to compare the new method with existing ones.
△ Less
Submitted 29 December, 2021;
originally announced December 2021.
-
Statistical Perspectives on Reliability of Artificial Intelligence Systems
Authors:
Yili Hong,
Jiayi Lian,
Li Xu,
Jie Min,
Yueyao Wang,
Laura J. Freeman,
Xinwei Deng
Abstract:
Artificial intelligence (AI) systems have become increasingly popular in many areas. Nevertheless, AI technologies are still in their developing stages, and many issues need to be addressed. Among those, the reliability of AI systems needs to be demonstrated so that the AI systems can be used with confidence by the general public. In this paper, we provide statistical perspectives on the reliabili…
▽ More
Artificial intelligence (AI) systems have become increasingly popular in many areas. Nevertheless, AI technologies are still in their developing stages, and many issues need to be addressed. Among those, the reliability of AI systems needs to be demonstrated so that the AI systems can be used with confidence by the general public. In this paper, we provide statistical perspectives on the reliability of AI systems. Different from other considerations, the reliability of AI systems focuses on the time dimension. That is, the system can perform its designed functionality for the intended period. We introduce a so-called SMART statistical framework for AI reliability research, which includes five components: Structure of the system, Metrics of reliability, Analysis of failure causes, Reliability assessment, and Test planning. We review traditional methods in reliability data analysis and software reliability, and discuss how those existing methods can be transformed for reliability modeling and assessment of AI systems. We also describe recent developments in modeling and analysis of AI reliability and outline statistical research challenges in this area, including out-of-distribution detection, the effect of the training set, adversarial attacks, model accuracy, and uncertainty quantification, and discuss how those topics can be related to AI reliability, with illustrative examples. Finally, we discuss data collection and test planning for AI reliability assessment and how to improve system designs for higher AI reliability. The paper closes with some concluding remarks.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
Sequential Kernel Embedding for Mediated and Time-Varying Dose Response Curves
Authors:
Rahul Singh,
Liyuan Xu,
Arthur Gretton
Abstract:
We propose simple nonparametric estimators for mediated and time-varying dose response curves based on kernel ridge regression. By embedding Pearl's mediation formula and Robins' g-formula with kernels, we allow treatments, mediators, and covariates to be continuous in general spaces, and also allow for nonlinear treatment-confounder feedback. Our key innovation is a reproducing kernel Hilbert spa…
▽ More
We propose simple nonparametric estimators for mediated and time-varying dose response curves based on kernel ridge regression. By embedding Pearl's mediation formula and Robins' g-formula with kernels, we allow treatments, mediators, and covariates to be continuous in general spaces, and also allow for nonlinear treatment-confounder feedback. Our key innovation is a reproducing kernel Hilbert space technique called sequential kernel embedding, which we use to construct simple estimators that account for complex feedback. Our estimators preserve the generality of classic identification while also achieving nonasymptotic uniform rates. In nonlinear simulations with many covariates, we demonstrate strong performance. We estimate mediated and time-varying dose response curves of the US Job Corps, and clean data that may serve as a benchmark in future work. We extend our results to mediated and time-varying treatment effects and counterfactual distributions, verifying semiparametric efficiency and weak convergence.
△ Less
Submitted 16 March, 2025; v1 submitted 6 November, 2021;
originally announced November 2021.
-
Dimension Reduction for Fréchet Regression
Authors:
Qi Zhang,
Lingzhou Xue,
Bing Li
Abstract:
With the rapid development of data collection techniques, complex data objects that are not in the Euclidean space are frequently encountered in new statistical applications. Fréchet regression model (Peterson & Müller 2019) provides a promising framework for regression analysis with metric space-valued responses. In this paper, we introduce a flexible sufficient dimension reduction (SDR) method f…
▽ More
With the rapid development of data collection techniques, complex data objects that are not in the Euclidean space are frequently encountered in new statistical applications. Fréchet regression model (Peterson & Müller 2019) provides a promising framework for regression analysis with metric space-valued responses. In this paper, we introduce a flexible sufficient dimension reduction (SDR) method for Fréchet regression to achieve two purposes: to mitigate the curse of dimensionality caused by high-dimensional predictors and to provide a visual inspection tool for Fréchet regression. Our approach is flexible enough to turn any existing SDR method for Euclidean (X,Y) into one for Euclidean X and metric space-valued Y. The basic idea is to first map the metric-space valued random object $Y$ to a real-valued random variable $f(Y)$ using a class of functions, and then perform classical SDR to the transformed data. If the class of functions is sufficiently rich, then we are guaranteed to uncover the Fréchet SDR space. We showed that such a class, which we call an ensemble, can be generated by a universal kernel. We established the consistency and asymptotic convergence rate of the proposed methods. The finite-sample performance of the proposed methods is illustrated through simulation studies for several commonly encountered metric spaces that include Wasserstein space, the space of symmetric positive definite matrices, and the sphere. We illustrated the data visualization aspect of our method by exploring the human mortality distribution data across countries and by studying the distribution of hematoma density.
△ Less
Submitted 6 December, 2022; v1 submitted 1 October, 2021;
originally announced October 2021.