Search | arXiv e-print repository

Inverse problem for fractional Schrödinger equations with drift on closed Riemannian manifolds

Abstract: This paper is concerned about the inverse coefficient problems of variable-coefficient fractional Schrödinger equations with drift on connected closed Riemannian manifolds. We prove that the knowledge of the underlying equation on any non-empty open subset of the underlying manifold determines the Riemannian metric, the drift and the potential, simultaneously and uniquely, up to a gauge transforma… ▽ More This paper is concerned about the inverse coefficient problems of variable-coefficient fractional Schrödinger equations with drift on connected closed Riemannian manifolds. We prove that the knowledge of the underlying equation on any non-empty open subset of the underlying manifold determines the Riemannian metric, the drift and the potential, simultaneously and uniquely, up to a gauge transformation. This paper extends the result in \cite{feizmohammadi2021fractionalanisotropiccalderonproblem} for principal terms. Not only can we retrieve lower order terms, but we are also able to achieve the simultaneous inversion of all terms. The key ingredient is a novel Runge approximation of fractional PDEs on Riemannian manifolds. △ Less

Submitted 30 August, 2025; v1 submitted 22 August, 2025; originally announced August 2025.

Comments: There is a gap in the proof

arXiv:2507.09388 [pdf, ps, other]

Optimal Differentially Private Ranking from Pairwise Comparisons

Authors: T. Tony Cai, Abhinav Chakraborty, Yichen Wang

Abstract: Data privacy is a central concern in many applications involving ranking from incomplete and noisy pairwise comparisons, such as recommendation systems, educational assessments, and opinion surveys on sensitive topics. In this work, we propose differentially private algorithms for ranking based on pairwise comparisons. Specifically, we develop and analyze ranking methods under two privacy notions:… ▽ More Data privacy is a central concern in many applications involving ranking from incomplete and noisy pairwise comparisons, such as recommendation systems, educational assessments, and opinion surveys on sensitive topics. In this work, we propose differentially private algorithms for ranking based on pairwise comparisons. Specifically, we develop and analyze ranking methods under two privacy notions: edge differential privacy, which protects the confidentiality of individual comparison outcomes, and individual differential privacy, which safeguards potentially many comparisons contributed by a single individual. Our algorithms--including a perturbed maximum likelihood estimator and a noisy count-based method--are shown to achieve minimax optimal rates of convergence under the respective privacy constraints. We further demonstrate the practical effectiveness of our methods through experiments on both simulated and real-world data. △ Less

Submitted 12 July, 2025; originally announced July 2025.

arXiv:2504.08084 [pdf, other]

Generalized torsion in amalgams

Authors: Tommy Wuxing Cai, Adam Clay

Abstract: We give a condition sufficient to ensure that an amalgam of groups is generalized torsion-free. As applications, we construct a closed 3-manifold whose fundamental group is generalized torsion-free and non bi-orderable; a one-relator group which is generalized torsion-free and non bi-orderable; and a group which is generalized torsion-free and non left-orderable. We give a condition sufficient to ensure that an amalgam of groups is generalized torsion-free. As applications, we construct a closed 3-manifold whose fundamental group is generalized torsion-free and non bi-orderable; a one-relator group which is generalized torsion-free and non bi-orderable; and a group which is generalized torsion-free and non left-orderable. △ Less

Submitted 10 April, 2025; originally announced April 2025.

Comments: 50 pages, 1 figure,

MSC Class: 05E16; 06F15; 20F60; 57M05

arXiv:2412.18992 [pdf, other]

Optimal Federated Learning for Functional Mean Estimation under Heterogeneous Privacy Constraints

Authors: Tony Cai, Abhinav Chakraborty, Lasse Vuursteen

Abstract: Federated learning (FL) is a distributed machine learning technique designed to preserve data privacy and security, and it has gained significant importance due to its broad range of applications. This paper addresses the problem of optimal functional mean estimation from discretely sampled data in a federated setting. We consider a heterogeneous framework where the number of individuals, measur… ▽ More Federated learning (FL) is a distributed machine learning technique designed to preserve data privacy and security, and it has gained significant importance due to its broad range of applications. This paper addresses the problem of optimal functional mean estimation from discretely sampled data in a federated setting. We consider a heterogeneous framework where the number of individuals, measurements per individual, and privacy parameters vary across one or more servers, under both common and independent design settings. In the common design setting, the same design points are measured for each individual, whereas in the independent design, each individual has their own random collection of design points. Within this framework, we establish minimax upper and lower bounds for the estimation error of the underlying mean function, highlighting the nuanced differences between common and independent designs under distributed privacy constraints. We propose algorithms that achieve the optimal trade-off between privacy and accuracy and provide optimality results that quantify the fundamental limits of private functional mean estimation across diverse distributed settings. These results characterize the cost of privacy and offer practical insights into the potential for privacy-preserving statistical analysis in federated environments. △ Less

Submitted 15 January, 2025; v1 submitted 25 December, 2024; originally announced December 2024.

Comments: 54 pages: 25 page article and 29 pages of appendix

MSC Class: 62G08; 62C20; 68P27; 62F30

arXiv:2411.15660 [pdf, other]

Federated PCA and Estimation for Spiked Covariance Matrices: Optimal Rates and Efficient Algorithm

Authors: Jingyang Li, T. Tony Cai, Dong Xia, Anru R. Zhang

Abstract: Federated Learning (FL) has gained significant recent attention in machine learning for its enhanced privacy and data security, making it indispensable in fields such as healthcare, finance, and personalized services. This paper investigates federated PCA and estimation for spiked covariance matrices under distributed differential privacy constraints. We establish minimax rates of convergence, wit… ▽ More Federated Learning (FL) has gained significant recent attention in machine learning for its enhanced privacy and data security, making it indispensable in fields such as healthcare, finance, and personalized services. This paper investigates federated PCA and estimation for spiked covariance matrices under distributed differential privacy constraints. We establish minimax rates of convergence, with a key finding that the central server's optimal rate is the harmonic mean of the local clients' minimax rates. This guarantees consistent estimation at the central server as long as at least one local client provides consistent results. Notably, consistency is maintained even if some local estimators are inconsistent, provided there are enough clients. These findings highlight the robustness and scalability of FL for reliable statistical inference under privacy constraints. To establish minimax lower bounds, we derive a matrix version of van Trees' inequality, which is of independent interest. Furthermore, we propose an efficient algorithm that preserves differential privacy while achieving near-optimal rates at the central server, up to a logarithmic factor. We address significant technical challenges in analyzing this algorithm, which involves a three-layer spectral decomposition. Numerical performance of the proposed algorithm is investigated using both simulated and real data. △ Less

Submitted 23 November, 2024; originally announced November 2024.

arXiv:2410.07454 [pdf, other]

Representation-Enhanced Neural Knowledge Integration with Application to Large-Scale Medical Ontology Learning

Authors: Suqi Liu, Tianxi Cai, Xiaoou Li

Abstract: A large-scale knowledge graph enhances reproducibility in biomedical data discovery by providing a standardized, integrated framework that ensures consistent interpretation across diverse datasets. It improves generalizability by connecting data from various sources, enabling broader applicability of findings across different populations and conditions. Generating reliable knowledge graph, leverag… ▽ More A large-scale knowledge graph enhances reproducibility in biomedical data discovery by providing a standardized, integrated framework that ensures consistent interpretation across diverse datasets. It improves generalizability by connecting data from various sources, enabling broader applicability of findings across different populations and conditions. Generating reliable knowledge graph, leveraging multi-source information from existing literature, however, is challenging especially with a large number of node sizes and heterogeneous relations. In this paper, we propose a general theoretically guaranteed statistical framework, called RENKI, to enable simultaneous learning of multiple relation types. RENKI generalizes various network models widely used in statistics and computer science. The proposed framework incorporates representation learning output into initial entity embedding of a neural network that approximates the score function for the knowledge graph and continuously trains the model to fit observed facts. We prove nonasymptotic bounds for in-sample and out-of-sample weighted MSEs in relation to the pseudo-dimension of the knowledge graph function class. Additionally, we provide pseudo-dimensions for score functions based on multilayer neural networks with ReLU activation function, in the scenarios when the embedding parameters either fixed or trainable. Finally, we complement our theoretical results with numerical studies and apply the method to learn a comprehensive medical knowledge graph combining a pretrained language model representation with knowledge graph links observed in several medical ontologies. The experiments justify our theoretical findings and demonstrate the effect of weighting in the presence of heterogeneous relations and the benefit of incorporating representation learning in nonparametric models. △ Less

Submitted 9 October, 2024; originally announced October 2024.

arXiv:2406.20088 [pdf, other]

Minimax And Adaptive Transfer Learning for Nonparametric Classification under Distributed Differential Privacy Constraints

Authors: Arnab Auddy, T. Tony Cai, Abhinav Chakraborty

Abstract: This paper considers minimax and adaptive transfer learning for nonparametric classification under the posterior drift model with distributed differential privacy constraints. Our study is conducted within a heterogeneous framework, encompassing diverse sample sizes, varying privacy parameters, and data heterogeneity across different servers. We first establish the minimax misclassification rate,… ▽ More This paper considers minimax and adaptive transfer learning for nonparametric classification under the posterior drift model with distributed differential privacy constraints. Our study is conducted within a heterogeneous framework, encompassing diverse sample sizes, varying privacy parameters, and data heterogeneity across different servers. We first establish the minimax misclassification rate, precisely characterizing the effects of privacy constraints, source samples, and target samples on classification accuracy. The results reveal interesting phase transition phenomena and highlight the intricate trade-offs between preserving privacy and achieving classification accuracy. We then develop a data-driven adaptive classifier that achieves the optimal rate within a logarithmic factor across a large collection of parameter spaces while satisfying the same set of differential privacy constraints. Simulation studies and real-world data applications further elucidate the theoretical analysis with numerical results. △ Less

Submitted 28 June, 2024; originally announced June 2024.

MSC Class: 62G08; 62G20

arXiv:2406.18876 [pdf, other]

Ordered bases, order-preserving automorphisms and bi-orderable link groups

Authors: Tommy Wuxing Cai, Adam Clay, Dale Rolfsen

Abstract: We give a new criterion which guarantees that a free group admits a bi-ordering that is invariant under a given automorphism. As an application, we show that the fundamental group of the "magic manifold" is bi-orderable, answering a question of Kin and Rolfsen. We give a new criterion which guarantees that a free group admits a bi-ordering that is invariant under a given automorphism. As an application, we show that the fundamental group of the "magic manifold" is bi-orderable, answering a question of Kin and Rolfsen. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 19 pages, 2 figures

MSC Class: 06F15; 20F60; 57M05; 57K30

arXiv:2406.06755 [pdf, other]

Optimal Federated Learning for Nonparametric Regression with Heterogeneous Distributed Differential Privacy Constraints

Authors: T. Tony Cai, Abhinav Chakraborty, Lasse Vuursteen

Abstract: This paper studies federated learning for nonparametric regression in the context of distributed samples across different servers, each adhering to distinct differential privacy constraints. The setting we consider is heterogeneous, encompassing both varying sample sizes and differential privacy constraints across servers. Within this framework, both global and pointwise estimation are considered,… ▽ More This paper studies federated learning for nonparametric regression in the context of distributed samples across different servers, each adhering to distinct differential privacy constraints. The setting we consider is heterogeneous, encompassing both varying sample sizes and differential privacy constraints across servers. Within this framework, both global and pointwise estimation are considered, and optimal rates of convergence over the Besov spaces are established. Distributed privacy-preserving estimators are proposed and their risk properties are investigated. Matching minimax lower bounds, up to a logarithmic factor, are established for both global and pointwise estimation. Together, these findings shed light on the tradeoff between statistical accuracy and privacy preservation. In particular, we characterize the compromise not only in terms of the privacy budget but also concerning the loss incurred by distributing data within the privacy framework as a whole. This insight captures the folklore wisdom that it is easier to retain privacy in larger samples, and explores the differences between pointwise and global estimation under distributed privacy constraints. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 49 pages total, consisting of an article (24 pages) and a supplement (25 pages)

MSC Class: 62G08; 62C20; 68P27; 62F30;

arXiv:2406.06749 [pdf, other]

Federated Nonparametric Hypothesis Testing with Differential Privacy Constraints: Optimal Rates and Adaptive Tests

Authors: T. Tony Cai, Abhinav Chakraborty, Lasse Vuursteen

Abstract: Federated learning has attracted significant recent attention due to its applicability across a wide range of settings where data is collected and analyzed across disparate locations. In this paper, we study federated nonparametric goodness-of-fit testing in the white-noise-with-drift model under distributed differential privacy (DP) constraints. We first establish matching lower and upper bound… ▽ More Federated learning has attracted significant recent attention due to its applicability across a wide range of settings where data is collected and analyzed across disparate locations. In this paper, we study federated nonparametric goodness-of-fit testing in the white-noise-with-drift model under distributed differential privacy (DP) constraints. We first establish matching lower and upper bounds, up to a logarithmic factor, on the minimax separation rate. This optimal rate serves as a benchmark for the difficulty of the testing problem, factoring in model characteristics such as the number of observations, noise level, and regularity of the signal class, along with the strictness of the $(ε,δ)$-DP requirement. The results demonstrate interesting and novel phase transition phenomena. Furthermore, the results reveal an interesting phenomenon that distributed one-shot protocols with access to shared randomness outperform those without access to shared randomness. We also construct a data-driven testing procedure that possesses the ability to adapt to an unknown regularity parameter over a large collection of function classes with minimal additional cost, all while maintaining adherence to the same set of DP constraints. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 77 pages total; consisting of a main article (28 pages) and supplement (49 pages)

MSC Class: 62G10; 62C20; 68P27; 62F30

arXiv:2403.19410 [pdf, ps, other]

On the Exact Fourier Dimension of Sets of Well-Approximable Matrices

Authors: Thomas Cai, Kyle Hambrook

Abstract: We compute the exact Fourier dimension of the set of $Ψ$-well-approximable $m \times n$ matrices (and the set of $Ψ$-well-approximable numbers) in the homogeneous and inhomogeneous cases for any approximation function $Ψ$ satisfying $\sum_{q \in \mathbb{Z}^n} Ψ(q)^m < \infty$. We compute the exact Fourier dimension of the set of $Ψ$-well-approximable $m \times n$ matrices (and the set of $Ψ$-well-approximable numbers) in the homogeneous and inhomogeneous cases for any approximation function $Ψ$ satisfying $\sum_{q \in \mathbb{Z}^n} Ψ(q)^m < \infty$. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2401.12331 [pdf, other]

Transfer Learning for Functional Mean Estimation: Phase Transition and Adaptive Algorithms

Authors: T. Tony Cai, Dongwoo Kim, Hongming Pu

Abstract: This paper studies transfer learning for estimating the mean of random functions based on discretely sampled data, where, in addition to observations from the target distribution, auxiliary samples from similar but distinct source distributions are available. The paper considers both common and independent designs and establishes the minimax rates of convergence for both designs. The results revea… ▽ More This paper studies transfer learning for estimating the mean of random functions based on discretely sampled data, where, in addition to observations from the target distribution, auxiliary samples from similar but distinct source distributions are available. The paper considers both common and independent designs and establishes the minimax rates of convergence for both designs. The results reveal an interesting phase transition phenomenon under the two designs and demonstrate the benefits of utilizing the source samples in the low sampling frequency regime. For practical applications, this paper proposes novel data-driven adaptive algorithms that attain the optimal rates of convergence within a logarithmic factor simultaneously over a large collection of parameter spaces. The theoretical findings are complemented by a simulation study that further supports the effectiveness of the proposed algorithms △ Less

Submitted 27 March, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

MSC Class: Primary 62J05; secondary 62G20

arXiv:2401.03820 [pdf, other]

Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices

Authors: T. Tony Cai, Dong Xia, Mengyue Zha

Abstract: Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (… ▽ More Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (PCA) and covariance estimation within the spiked covariance model. We precisely characterize the sensitivity of eigenvalues and eigenvectors under this model and establish the minimax rates of convergence for estimating both the principal components and covariance matrix. These rates hold up to logarithmic factors and encompass general Schatten norms, including spectral norm, Frobenius norm, and nuclear norm as special cases. We propose computationally efficient differentially private estimators and prove their minimax optimality for sub-Gaussian distributions, up to logarithmic factors. Additionally, matching minimax lower bounds are established. Notably, compared to the existing literature, our results accommodate a diverging rank, a broader range of signal strengths, and remain valid even when the sample size is much smaller than the dimension, provided the signal strength is sufficiently strong. Both simulation studies and real data experiments demonstrate the merits of our method. △ Less

Submitted 27 September, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

arXiv:2305.19997 [pdf, other]

Knowledge Graph Embedding with Electronic Health Records Data via Latent Graphical Block Model

Authors: Junwei Lu, Jin Yin, Tianxi Cai

Abstract: Due to the increasing adoption of electronic health records (EHR), large scale EHRs have become another rich data source for translational clinical research. Despite its potential, deriving generalizable knowledge from EHR data remains challenging. First, EHR data are generated as part of clinical care with data elements too detailed and fragmented for research. Despite recent progress in mapping… ▽ More Due to the increasing adoption of electronic health records (EHR), large scale EHRs have become another rich data source for translational clinical research. Despite its potential, deriving generalizable knowledge from EHR data remains challenging. First, EHR data are generated as part of clinical care with data elements too detailed and fragmented for research. Despite recent progress in mapping EHR data to common ontology with hierarchical structures, much development is still needed to enable automatic grouping of local EHR codes to meaningful clinical concepts at a large scale. Second, the total number of unique EHR features is large, imposing methodological challenges to derive reproducible knowledge graph, especially when interest lies in conditional dependency structure. Third, the detailed EHR data on a very large patient cohort imposes additional computational challenge to deriving a knowledge network. To overcome these challenges, we propose to infer the conditional dependency structure among EHR features via a latent graphical block model (LGBM). The LGBM has a two layer structure with the first providing semantic embedding vector (SEV) representation for the EHR features and the second overlaying a graphical block model on the latent SEVs. The block structures on the graphical model also allows us to cluster synonymous features in EHR. We propose to learn the LGBM efficiently, in both statistical and computational sense, based on the empirical point mutual information matrix. We establish the statistical rates of the proposed estimators and show the perfect recovery of the block structure. Numerical results from simulation studies and real EHR data analyses suggest that the proposed LGBM estimator performs well in finite sample. △ Less

Submitted 31 May, 2023; originally announced May 2023.

arXiv:2305.17608 [pdf, other]

Reward Collapse in Aligning Large Language Models

Authors: Ziang Song, Tianle Cai, Jason D. Lee, Weijie J. Su

Abstract: The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results i… ▽ More The extraordinary capabilities of large language models (LLMs) such as ChatGPT and GPT-4 are in part unleashed by aligning them with reward models that are trained on human preferences, which are often represented as rankings of responses to prompts. In this paper, we document the phenomenon of \textit{reward collapse}, an empirical observation where the prevailing ranking-based approach results in an \textit{identical} reward distribution \textit{regardless} of the prompts during the terminal phase of training. This outcome is undesirable as open-ended prompts like ``write a short story about your best friend'' should yield a continuous range of rewards for their completions, while specific prompts like ``what is the capital of New Zealand'' should generate either high or low rewards. Our theoretical investigation reveals that reward collapse is primarily due to the insufficiency of the ranking-based objective function to incorporate prompt-related information during optimization. This insight allows us to derive closed-form expressions for the reward distribution associated with a set of utility functions in an asymptotic regime. To overcome reward collapse, we introduce a prompt-aware optimization scheme that provably admits a prompt-dependent reward distribution within the interpolating regime. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models. △ Less

Submitted 27 May, 2023; originally announced May 2023.

arXiv:2305.00164 [pdf, other]

doi 10.1214/24-AOS2355

Estimation and inference for minimizer and minimum of convex functions: optimality, adaptivity and uncertainty principles

Authors: T. Tony Cai, Ran Chen, Yuancheng Zhu

Abstract: Optimal estimation and inference for both the minimizer and minimum of a convex regression function under the white noise and nonparametric regression models are studied in a nonasymptotic local minimax framework, where the performance of a procedure is evaluated at individual functions. Fully adaptive and computationally efficient algorithms are proposed and sharp minimax lower bounds are given f… ▽ More Optimal estimation and inference for both the minimizer and minimum of a convex regression function under the white noise and nonparametric regression models are studied in a nonasymptotic local minimax framework, where the performance of a procedure is evaluated at individual functions. Fully adaptive and computationally efficient algorithms are proposed and sharp minimax lower bounds are given for both the estimation accuracy and expected length of confidence intervals for the minimizer and minimum. The nonasymptotic local minimax framework brings out new phenomena in simultaneous estimation and inference for the minimizer and minimum. We establish a novel uncertainty principle that provides a fundamental limit on how well the minimizer and minimum can be estimated simultaneously for any convex regression function. A similar result holds for the expected length of the confidence intervals for the minimizer and minimum. △ Less

Submitted 9 March, 2024; v1 submitted 29 April, 2023; originally announced May 2023.

Journal ref: Ann. Statist. 52(1): 392-411 (February 2024)

arXiv:2303.07152 [pdf, ps, other]

Score Attack: A Lower Bound Technique for Optimal Differentially Private Learning

Authors: T. Tony Cai, Yichen Wang, Linjun Zhang

Abstract: Achieving optimal statistical performance while ensuring the privacy of personal data is a challenging yet crucial objective in modern data analysis. However, characterizing the optimality, particularly the minimax lower bound, under privacy constraints is technically difficult. To address this issue, we propose a novel approach called the score attack, which provides a lower bound on the differen… ▽ More Achieving optimal statistical performance while ensuring the privacy of personal data is a challenging yet crucial objective in modern data analysis. However, characterizing the optimality, particularly the minimax lower bound, under privacy constraints is technically difficult. To address this issue, we propose a novel approach called the score attack, which provides a lower bound on the differential-privacy-constrained minimax risk of parameter estimation. The score attack method is based on the tracing attack concept in differential privacy and can be applied to any statistical model with a well-defined score statistic. It can optimally lower bound the minimax risk of estimating unknown model parameters, up to a logarithmic factor, while ensuring differential privacy for a range of statistical problems. We demonstrate the effectiveness and optimality of this general method in various examples, such as the generalized linear model in both classical and high-dimensional sparse settings, the Bradley-Terry-Luce model for pairwise comparisons, and non-parametric regression over the Sobolev class. △ Less

Submitted 12 July, 2025; v1 submitted 13 March, 2023; originally announced March 2023.

arXiv:2301.10392 [pdf, other]

Statistical Inference and Large-scale Multiple Testing for High-dimensional Regression Models

Authors: T. Tony Cai, Zijian Guo, Yin Xia

Abstract: This paper presents a selective survey of recent developments in statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. We examine the construction of confidence intervals and hypothesis tests for various low-dimensional objectives such as regression coefficients and linear and quadratic functionals. The key technique is to gene… ▽ More This paper presents a selective survey of recent developments in statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. We examine the construction of confidence intervals and hypothesis tests for various low-dimensional objectives such as regression coefficients and linear and quadratic functionals. The key technique is to generate debiased and desparsified estimators for the targeted low-dimensional objectives and estimate their uncertainty. In addition to covering the motivations for and intuitions behind these statistical methods, we also discuss their optimality and adaptivity in the context of high-dimensional inference. In addition, we review the recent development of statistical inference based on multiple regression models and the advancement of large-scale multiple testing for high-dimensional regression. The R package SIHR has implemented some of the high-dimensional inference methods discussed in this paper. △ Less

Submitted 24 January, 2023; originally announced January 2023.

arXiv:2301.01381 [pdf, other]

Testing High-dimensional Multinomials with Applications to Text Analysis

Authors: T. Tony Cai, Zheng Tracy Ke, Paxton Turner

Abstract: Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of $K$ groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown… ▽ More Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of $K$ groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyze two real-world datasets to examine variation among consumer reviews of Amazon movies and diversity of statistical paper abstracts. △ Less

Submitted 24 November, 2023; v1 submitted 3 January, 2023; originally announced January 2023.

arXiv:2211.12612 [pdf, ps, other]

Transfer Learning for Contextual Multi-armed Bandits

Authors: Changxiao Cai, T. Tony Cai, Hongzhe Li

Abstract: Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the… ▽ More Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the minimax regret is proposed. The results quantify the contribution of the data from the source domains for learning in the target domain in the context of nonparametric contextual multi-armed bandits. In view of the general impossibility of adaptation to unknown smoothness, we develop a data-driven algorithm that achieves near-optimal statistical guarantees (up to a logarithmic factor) while automatically adapting to the unknown parameters over a large collection of parameter spaces under an additional self-similarity assumption. A simulation study is carried out to illustrate the benefits of utilizing the data from the auxiliary source domains for learning in the target domain. △ Less

Submitted 24 January, 2024; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: Accepted to the Annals of Statistics

arXiv:2202.10007 [pdf, other]

Statistical Inference for Genetic Relatedness Based on High-Dimensional Logistic Regression

Authors: Rong Ma, Zijian Guo, T. Tony Cai, Hongzhe Li

Abstract: This paper studies the problem of statistical inference for genetic relatedness between binary traits based on individual-level genome-wide association data. Specifically, under the high-dimensional logistic regression models, we define parameters characterizing the cross-trait genetic correlation, the genetic covariance and the trait-specific genetic variance. A novel weighted debiasing method is… ▽ More This paper studies the problem of statistical inference for genetic relatedness between binary traits based on individual-level genome-wide association data. Specifically, under the high-dimensional logistic regression models, we define parameters characterizing the cross-trait genetic correlation, the genetic covariance and the trait-specific genetic variance. A novel weighted debiasing method is developed for the logistic Lasso estimator and computationally efficient debiased estimators are proposed. The rates of convergence for these estimators are studied and their asymptotic normality is established under mild conditions. Moreover, we construct confidence intervals and statistical tests for these parameters, and provide theoretical justifications for the methods, including the coverage probability and expected length of the confidence intervals, as well as the size and power of the proposed tests. Numerical studies are conducted under both model generated data and simulated genetic data to show the superiority of the proposed methods. By analyzing a real data set on autoimmune diseases, we demonstrate its ability to obtain novel insights about the shared genetic architecture between ten pediatric autoimmune diseases. △ Less

Submitted 5 October, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

arXiv:2201.06438 [pdf, other]

Matrix Reordering for Noisy Disordered Matrices: Optimality and Computationally Efficient Algorithms

Authors: T. Tony Cai, Rong Ma

Abstract: Motivated by applications in single-cell biology and metagenomics, we investigate the problem of matrix reordering based on a noisy disordered monotone Toeplitz matrix model. We establish the fundamental statistical limit for this problem in a decision-theoretic framework and demonstrate that a constrained least squares estimator achieves the optimal rate. However, due to its computational complex… ▽ More Motivated by applications in single-cell biology and metagenomics, we investigate the problem of matrix reordering based on a noisy disordered monotone Toeplitz matrix model. We establish the fundamental statistical limit for this problem in a decision-theoretic framework and demonstrate that a constrained least squares estimator achieves the optimal rate. However, due to its computational complexity, we analyze a popular polynomial-time algorithm, spectral seriation, and show that it is suboptimal. To address this, we propose a novel polynomial-time adaptive sorting algorithm with guaranteed performance improvement. Simulations and analyses of two real single-cell RNA sequencing datasets demonstrate the superiority of our algorithm over existing methods. △ Less

Submitted 13 August, 2023; v1 submitted 17 January, 2022; originally announced January 2022.

Comments: accepted by IEEE Transactions on Information Theory

arXiv:2201.03727 [pdf, ps, other]

Estimation and Inference with Proxy Data and its Genetic Applications

Authors: Sai Li, T. Tony Cai, Hongzhe Li

Abstract: Existing high-dimensional statistical methods are largely established for analyzing individual-level data. In this work, we study estimation and inference for high-dimensional linear models where we only observe "proxy data", which include the marginal statistics and sample covariance matrix that are computed based on different sets of individuals. We develop a rate optimal method for estimation a… ▽ More Existing high-dimensional statistical methods are largely established for analyzing individual-level data. In this work, we study estimation and inference for high-dimensional linear models where we only observe "proxy data", which include the marginal statistics and sample covariance matrix that are computed based on different sets of individuals. We develop a rate optimal method for estimation and inference for the regression coefficient vector and its linear functionals based on the proxy data. Moreover, we show the intrinsic limitations in the proxy-data based inference: the minimax optimal rate for estimation is slower than that in the conventional case where individual data are observed; the power for testing and multiple testing does not go to one as the signal strength goes to infinity. These interesting findings are illustrated through simulation studies and an analysis of a dataset concerning the genetic associations of hindlimb muscle weight in a mouse population. △ Less

Submitted 10 January, 2022; originally announced January 2022.

arXiv:2112.09313 [pdf, other]

Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects

Authors: Larry Han, Jue Hou, Kelly Cho, Rui Duan, Tianxi Cai

Abstract: Federated learning of causal estimands may greatly improve estimation efficiency by leveraging data from multiple study sites, but robustness to heterogeneity and model misspecifications is vital for ensuring validity. We develop a Federated Adaptive Causal Estimation (FACE) framework to incorporate heterogeneous data from multiple sites to provide treatment effect estimation and inference for a f… ▽ More Federated learning of causal estimands may greatly improve estimation efficiency by leveraging data from multiple study sites, but robustness to heterogeneity and model misspecifications is vital for ensuring validity. We develop a Federated Adaptive Causal Estimation (FACE) framework to incorporate heterogeneous data from multiple sites to provide treatment effect estimation and inference for a flexibly specified target population of interest. FACE accounts for site-level heterogeneity in the distribution of covariates through density ratio weighting. To safely incorporate source sites and avoid negative transfer, we introduce an adaptive weighting procedure via a penalized regression, which achieves both consistency and optimal efficiency. Our strategy is communication-efficient and privacy-preserving, allowing participating sites to share summary statistics only once with other sites. We conduct both theoretical and numerical evaluations of FACE and apply it to conduct a comparative effectiveness study of BNT162b2 (Pfizer) and mRNA-1273 (Moderna) vaccines on COVID-19 outcomes in U.S. veterans using electronic health records from five VA regional sites. We show that compared to traditional methods, FACE meaningfully increases the precision of treatment effect estimates, with reductions in standard errors ranging from $26\%$ to $67\%$. △ Less

Submitted 5 October, 2023; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: 59 pages

arXiv:2111.02826 [pdf, other]

Finding the Optimal Dynamic Treatment Regime Using Smooth Fisher Consistent Surrogate Loss

Authors: Nilanjana Laha, Aaron Sonabend-W, Rajarshi Mukherjee, Tianxi Cai

Abstract: Large health care data repositories such as electronic health records (EHR) open new opportunities to derive individualized treatment strategies for complicated diseases such as sepsis. In this paper, we consider the problem of estimating sequential treatment rules tailored to a patient's individual characteristics, often referred to as dynamic treatment regimes (DTRs). Our main objective is to fi… ▽ More Large health care data repositories such as electronic health records (EHR) open new opportunities to derive individualized treatment strategies for complicated diseases such as sepsis. In this paper, we consider the problem of estimating sequential treatment rules tailored to a patient's individual characteristics, often referred to as dynamic treatment regimes (DTRs). Our main objective is to find the optimal DTR that maximizes a discontinuous value function through direct maximization of Fisher consistent surrogate loss functions. In this regard, we demonstrate that a large class of concave surrogates fails to be Fisher consistent -- a behavior that differs from the classical binary classification problems. We further characterize a non-concave family of Fisher consistent smooth surrogate functions, which is amenable to gradient-descent type optimization algorithms. Compared to the existing direct search approach under the support vector machine framework (Zhao et al., 2015), our proposed DTR estimation via surrogate loss optimization (DTRESLO) method is more computationally scalable to large sample sizes and allows for broader functional classes for treatment policies. We establish theoretical properties for our proposed DTR estimator and obtain a sharp upper bound on the regret corresponding to our DTRESLO method. The finite sample performance of our proposed estimator is evaluated through extensive simulations. Finally, we illustrate the working principles and benefits of our method for estimating an optimal DTR for treating sepsis using EHR data from sepsis patients admitted to intensive care units. △ Less

Submitted 30 September, 2023; v1 submitted 3 November, 2021; originally announced November 2021.

MSC Class: 62G20 ACM Class: G.3

arXiv:2110.12336 [pdf, other]

Efficient and Robust Semi-supervised Estimation of ATE with Partially Annotated Treatment and Response

Authors: Jue Hou, Rajarshi Mukherjee, Tianxi Cai

Abstract: A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response often cannot be accurately captured by readily available EHR features and require labor intensive manual chart review to precisely annotate, wh… ▽ More A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response often cannot be accurately captured by readily available EHR features and require labor intensive manual chart review to precisely annotate, which limits the number of available gold standard labels on these key variables. We consider average treatment effect (ATE) estimation under such a semi-supervised setting with a large number of unlabeled samples containing both confounders and imperfect EHR features for treatment and response. We derive the efficient influence function for ATE and use it to construct a semi-supervised multiple machine learning (SMMAL) estimator. We showcase that our SMMAL estimator is semi-parametric efficient with B-spline regression under low-dimensional smooth models. We develop the adaptive sparsity/model doubly robust estimation under high-dimensional logistic propensity score and outcome regression models. Results from simulation studies support the validity of our SMMAL method and its superiority over supervised benchmarks. △ Less

Submitted 23 October, 2021; originally announced October 2021.

arXiv:2107.00179 [pdf]

Distributed Nonparametric Function Estimation: Optimal Rate of Convergence and Cost of Adaptation

Authors: T. Tony Cai, Hongji Wei

Abstract: Distributed minimax estimation and distributed adaptive estimation under communication constraints for Gaussian sequence model and white noise model are studied. The minimax rate of convergence for distributed estimation over a given Besov class, which serves as a benchmark for the cost of adaptation, is established. We then quantify the exact communication cost for adaptation and construct an opt… ▽ More Distributed minimax estimation and distributed adaptive estimation under communication constraints for Gaussian sequence model and white noise model are studied. The minimax rate of convergence for distributed estimation over a given Besov class, which serves as a benchmark for the cost of adaptation, is established. We then quantify the exact communication cost for adaptation and construct an optimally adaptive procedure for distributed estimation over a range of Besov classes. The results demonstrate significant differences between nonparametric function estimation in the distributed setting and the conventional centralized setting. For global estimation, adaptation in general cannot be achieved for free in the distributed setting. The new technical tools to obtain the exact characterization for the cost of adaptation can be of independent interest. △ Less

Submitted 30 June, 2021; originally announced July 2021.

MSC Class: 62F30

arXiv:2105.10360 [pdf, other]

Multi-source Learning via Completion of Block-wise Overlapping Noisy Matrices

Authors: Doudou Zhou, Tianxi Cai, Junwei Lu

Abstract: Matrix completion has attracted attention in many fields, including statistics, applied mathematics, and electrical engineering. Most of the works focus on the independent sampling models under which the observed entries are sampled independently. Motivated by applications in the integration of knowledge graphs derived from multi-source biomedical data such as those from Electronic Health Records… ▽ More Matrix completion has attracted attention in many fields, including statistics, applied mathematics, and electrical engineering. Most of the works focus on the independent sampling models under which the observed entries are sampled independently. Motivated by applications in the integration of knowledge graphs derived from multi-source biomedical data such as those from Electronic Health Records (EHR) and biomedical text, we propose the {\bf B}lock-wise {\bf O}verlapping {\bf N}oisy {\bf M}atrix {\bf I}ntegration (BONMI) to treat blockwise missingness of symmetric matrices representing relatedness between entity pairs. Our idea is to exploit the orthogonal Procrustes problem to align the eigenspace of the two sub-matrices, then complete the missing blocks by the inner product of the two low-rank components. Besides, we prove the statistical rate for the eigenspace of the underlying matrix, which is comparable to the rate under the independently missing assumption. Simulation studies show that the method performs well under a variety of configurations. In the real data analysis, the method is applied to two tasks: (i) the integrating of several point-wise mutual information matrices built by English EHR and Chinese medical text data, and (ii) the machine translation between English and Chinese medical concepts. Our method shows an advantage over existing methods. △ Less

Submitted 9 October, 2021; v1 submitted 21 May, 2021; originally announced May 2021.

arXiv:2105.07536 [pdf, other]

Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data

Authors: T. Tony Cai, Rong Ma

Abstract: This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations… ▽ More This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations based on the underlying graph Laplacian, characterize its limiting behavior, and uncover its deep connection to Laplacian spectral clustering, and fundamental principles including early stopping as implicit regularization. The results explain the intrinsic mechanism and the empirical benefits of such a computational strategy. For the embedding stage of t-SNE, we characterize the kinematics of the low-dimensional map throughout the iterations, and identify an amplification phase, featuring the intercluster repulsion and the expansive behavior of the low-dimensional map, and a stabilization phase. The general theory explains the fast convergence rate and the exceptional empirical performance of t-SNE for visualizing clustered data, brings forth interpretations of the t-SNE visualizations, and provides theoretical guidance for applying t-SNE and selecting its tuning parameters in various applications. △ Less

Submitted 31 October, 2022; v1 submitted 16 May, 2021; originally announced May 2021.

Comments: Accepted by Journal of Machine Learning Research

arXiv:2105.01264 [pdf, other]

Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

Authors: Jue Hou, Zijian Guo, Tianxi Cai

Abstract: Risk modeling with EHR data is challenging due to a lack of direct observations on the disease outcome, and the high dimensionality of the candidate predictors. In this paper, we develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large unlabeled data on candidate predictors and surrogates of outcome, as well as a sma… ▽ More Risk modeling with EHR data is challenging due to a lack of direct observations on the disease outcome, and the high dimensionality of the candidate predictors. In this paper, we develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large unlabeled data on candidate predictors and surrogates of outcome, as well as a small labeled data with annotated outcomes. The SAS procedure borrows information from surrogates along with candidate predictors to impute the unobserved outcomes via a sparse working imputation model with moment conditions to achieve robustness against mis-specification in the imputation model and a one-step bias correction to enable interval estimation for the predicted risk. We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model, even when the underlying risk prediction model is dense and the risk model is mis-specified. We present an extensive simulation study to demonstrate the superiority of our SSL approach compared to existing supervised methods. We apply the method to derive genetic risk prediction of type-2 diabetes mellitus using a EHR biobank cohort. △ Less

Submitted 3 May, 2021; originally announced May 2021.

arXiv:2103.12846 [pdf, ps, other]

On the global identifiability of logistic regression models with misclassified outcomes

Authors: Rui Duan, Yang Ning, Jiasheng Shi, Raymond J Carroll, Tianxi Cai, Yong Chen

Abstract: In the last decade, the secondary use of large data from health systems, such as electronic health records, has demonstrated great promise in advancing biomedical discoveries and improving clinical decision making. However, there is an increasing concern about biases in association studies caused by misclassification in the binary outcomes derived from electronic health records. We revisit the cla… ▽ More In the last decade, the secondary use of large data from health systems, such as electronic health records, has demonstrated great promise in advancing biomedical discoveries and improving clinical decision making. However, there is an increasing concern about biases in association studies caused by misclassification in the binary outcomes derived from electronic health records. We revisit the classical logistic regression model with misclassified outcomes. Despite that local identification conditions in some related settings have been previously established, the global identification of such models remains largely unknown and is an important question yet to be answered. We derive necessary and sufficient conditions for global identifiability of logistic regression models with misclassified outcomes, using a novel approach termed as the submodel analysis, and a technique adapted from the Picard-Lindelöf existence theorem in ordinary differential equations. In particular, our results are applicable to logistic models with discrete covariates, which is a common situation in biomedical studies, The conditions are easy to verify in practice. In addition to model identifiability, we propose a hypothesis testing procedure for regression coefficients in the misclassified logistic regression model when the model is not identifiable under the null. △ Less

Submitted 23 March, 2021; originally announced March 2021.

arXiv:2102.08807 [pdf, other]

The Linearized Hellinger--Kantorovich Distance

Authors: Tianji Cai, Junyi Cheng, Bernhard Schmitzer, Matthew Thorpe

Abstract: In this paper we study the local linearization of the Hellinger--Kantorovich distance via its Riemannian structure. We give explicit expressions for the logarithmic and exponential map and identify a suitable notion of a Riemannian inner product. Samples can thus be represented as vectors in the tangent space of a suitable reference measure where the norm locally approximates the original metric.… ▽ More In this paper we study the local linearization of the Hellinger--Kantorovich distance via its Riemannian structure. We give explicit expressions for the logarithmic and exponential map and identify a suitable notion of a Riemannian inner product. Samples can thus be represented as vectors in the tangent space of a suitable reference measure where the norm locally approximates the original metric. Working with the local linearization and the corresponding embeddings allows for the advantages of the Euclidean setting, such as faster computations and a plethora of data analysis tools, whilst still enjoying approximately the descriptive power of the Hellinger--Kantorovich metric. △ Less

Submitted 24 September, 2021; v1 submitted 17 February, 2021; originally announced February 2021.

arXiv:2011.03900 [pdf, other]

The Cost of Privacy in Generalized Linear Models: Algorithms and Minimax Lower Bounds

Authors: T. Tony Cai, Yichen Wang, Linjun Zhang

Abstract: We propose differentially private algorithms for parameter estimation in both low-dimensional and high-dimensional sparse generalized linear models (GLMs) by constructing private versions of projected gradient descent. We show that the proposed algorithms are nearly rate-optimal by characterizing their statistical performance and establishing privacy-constrained minimax lower bounds for GLMs. The… ▽ More We propose differentially private algorithms for parameter estimation in both low-dimensional and high-dimensional sparse generalized linear models (GLMs) by constructing private versions of projected gradient descent. We show that the proposed algorithms are nearly rate-optimal by characterizing their statistical performance and establishing privacy-constrained minimax lower bounds for GLMs. The lower bounds are obtained via a novel technique, which is based on Stein's Lemma and generalizes the tracing attack technique for privacy-constrained lower bounds. This lower bound argument can be of independent interest as it is applicable to general parametric models. Simulated and real data experiments are conducted to demonstrate the numerical performance of our algorithms. △ Less

Submitted 5 December, 2020; v1 submitted 7 November, 2020; originally announced November 2020.

Comments: 56 pages, 6 figures

arXiv:2009.03294 [pdf, other]

GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training

Authors: Tianle Cai, Shengjie Luo, Keyulu Xu, Di He, Tie-Yan Liu, Liwei Wang

Abstract: Normalization is known to help the optimization of deep neural networks. Curiously, different architectures require specialized normalization methods. In this paper, we study what normalization is effective for Graph Neural Networks (GNNs). First, we adapt and evaluate the existing methods from other domains to GNNs. Faster convergence is achieved with InstanceNorm compared to BatchNorm and LayerN… ▽ More Normalization is known to help the optimization of deep neural networks. Curiously, different architectures require specialized normalization methods. In this paper, we study what normalization is effective for Graph Neural Networks (GNNs). First, we adapt and evaluate the existing methods from other domains to GNNs. Faster convergence is achieved with InstanceNorm compared to BatchNorm and LayerNorm. We provide an explanation by showing that InstanceNorm serves as a preconditioner for GNNs, but such preconditioning effect is weaker with BatchNorm due to the heavy batch noise in graph datasets. Second, we show that the shift operation in InstanceNorm results in an expressiveness degradation of GNNs for highly regular graphs. We address this issue by proposing GraphNorm with a learnable shift. Empirically, GNNs with GraphNorm converge faster compared to GNNs using other normalization. GraphNorm also improves the generalization of GNNs, achieving better performance on graph classification benchmarks. △ Less

Submitted 11 June, 2021; v1 submitted 7 September, 2020; originally announced September 2020.

Comments: ICML 2021, Code: https://github.com/lsj2408/GraphNorm

arXiv:2008.12434 [pdf, ps, other]

On the Non-Asymptotic Concentration of Heteroskedastic Wishart-type Matrix

Authors: T. Tony Cai, Rungang Han, Anru R. Zhang

Abstract: This paper focuses on the non-asymptotic concentration of the heteroskedastic Wishart-type matrices. Suppose $Z$ is a $p_1$-by-$p_2$ random matrix and $Z_{ij} \sim N(0,σ_{ij}^2)$ independently, we prove the expected spectral norm of Wishart matrix deviations (i.e., $\mathbb{E} \left\|ZZ^\top - \mathbb{E} ZZ^\top\right\|$) is upper bounded by \begin{equation*} \begin{split} (1+ε)\left\{2σ_Cσ_R… ▽ More This paper focuses on the non-asymptotic concentration of the heteroskedastic Wishart-type matrices. Suppose $Z$ is a $p_1$-by-$p_2$ random matrix and $Z_{ij} \sim N(0,σ_{ij}^2)$ independently, we prove the expected spectral norm of Wishart matrix deviations (i.e., $\mathbb{E} \left\|ZZ^\top - \mathbb{E} ZZ^\top\right\|$) is upper bounded by \begin{equation*} \begin{split} (1+ε)\left\{2σ_Cσ_R + σ_C^2 + Cσ_Rσ_*\sqrt{\log(p_1 \wedge p_2)} + Cσ_*^2\log(p_1 \wedge p_2)\right\}, \end{split} \end{equation*} where $σ_C^2 := \max_j \sum_{i=1}^{p_1}σ_{ij}^2$, $σ_R^2 := \max_i \sum_{j=1}^{p_2}σ_{ij}^2$ and $σ_*^2 := \max_{i,j}σ_{ij}^2$. A minimax lower bound is developed that matches this upper bound. Then, we derive the concentration inequalities, moments, and tail bounds for the heteroskedastic Wishart-type matrix under more general distributions, such as sub-Gaussian and heavy-tailed distributions. Next, we consider the cases where $Z$ has homoskedastic columns or rows (i.e., $σ_{ij} \approx σ_i$ or $σ_{ij} \approx σ_j$) and derive the rate-optimal Wishart-type concentration bounds. Finally, we apply the developed tools to identify the sharp signal-to-noise ratio threshold for consistent clustering in the heteroskedastic clustering problem. △ Less

Submitted 16 February, 2022; v1 submitted 27 August, 2020; originally announced August 2020.

Comments: Electronic Journal of Probability, to appear

arXiv:2006.02025 [pdf, ps, other]

doi 10.37236/8091

Deformation of Cayley's hyperdeterminants

Authors: Tommy Wuxing Cai, Naihuan Jing

Abstract: We introduce a deformation of Cayley's second hyperdeterminant for even-dimensional hypermatrices. As an application, we formulate a generalization of the Jacobi-Trudi formula for Macdonald functions of rectangular shapes generalizing Matsumoto's formula for Jack functions. We introduce a deformation of Cayley's second hyperdeterminant for even-dimensional hypermatrices. As an application, we formulate a generalization of the Jacobi-Trudi formula for Macdonald functions of rectangular shapes generalizing Matsumoto's formula for Jack functions. △ Less

Submitted 5 June, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

Comments: 9 pages, 0 figures

MSC Class: Primary: 05E05; Secondary: 17B69; 05E10

Journal ref: Elec. J. Combin. 27(2) (2020) P2.50

arXiv:2002.07624 [pdf, other]

Optimal Structured Principal Subspace Estimation: Metric Entropy and Minimax Rates

Authors: T. Tony Cai, Hongzhe Li, Rong Ma

Abstract: Driven by a wide range of applications, many principal subspace estimation problems have been studied individually under different structural constraints. This paper presents a unified framework for the statistical analysis of a general structured principal subspace estimation problem which includes as special cases non-negative PCA/SVD, sparse PCA/SVD, subspace constrained PCA/SVD, and spectral c… ▽ More Driven by a wide range of applications, many principal subspace estimation problems have been studied individually under different structural constraints. This paper presents a unified framework for the statistical analysis of a general structured principal subspace estimation problem which includes as special cases non-negative PCA/SVD, sparse PCA/SVD, subspace constrained PCA/SVD, and spectral clustering. General minimax lower and upper bounds are established to characterize the interplay between the information-geometric complexity of the structural set for the principal subspaces, the signal-to-noise ratio (SNR), and the dimensionality. The results yield interesting phase transition phenomena concerning the rates of convergence as a function of the SNRs and the fundamental limit for consistent estimation. Applying the general results to the specific settings yields the minimax rates of convergence for those problems, including the previous unknown optimal rates for non-negative PCA/SVD, sparse SVD and subspace constrained PCA/SVD. △ Less

Submitted 16 November, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

arXiv:2001.08877 [pdf, other]

Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms

Authors: T. Tony Cai, Hongji Wei

Abstract: We study distributed estimation of a Gaussian mean under communication constraints in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between the communication costs and statistical accuracy, are established in both the univariate and multivariate settings. Communication-efficient and statistically optimal procedures are developed. In the univariate… ▽ More We study distributed estimation of a Gaussian mean under communication constraints in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between the communication costs and statistical accuracy, are established in both the univariate and multivariate settings. Communication-efficient and statistically optimal procedures are developed. In the univariate case, the optimal rate depends only on the total communication budget, so long as each local machine has at least one bit. However, in the multivariate case, the minimax rate depends on the specific allocations of the communication budgets among the local machines. Although optimal estimation of a Gaussian mean is relatively simple in the conventional setting, it is quite involved under the communication constraints, both in terms of the optimal procedure design and lower bound argument. The techniques developed in this paper can be of independent interest. An essential step is the decomposition of the minimax estimation problem into two stages, localization and refinement. This critical decomposition provides a framework for both the lower bound analysis and optimal procedure design. △ Less

Submitted 23 January, 2020; originally announced January 2020.

arXiv:1912.03870 [pdf, ps, other]

doi 10.1088/1742-5468/aba0aa

Correlation functions of charged free boson and fermion systems

Authors: Naihuan Jing, Zhijun Li, Tommy Wuxing Cai

Abstract: Using the idea of the quantum inverse scattering method, we introduce the operators $\mathbf{B}(x), \mathbf{C}(x)$ and $\mathbf{\tilde{B}}(x), \mathbf{\tilde{C}}(x)$ corresponding to the off-diagonal entries of the monodromy matrix $T$ for the phase model and $i$-boson model in terms of bc fermions and neutral fermions respectively, thus giving alternative treatment of the KP and BKP hierarchies.… ▽ More Using the idea of the quantum inverse scattering method, we introduce the operators $\mathbf{B}(x), \mathbf{C}(x)$ and $\mathbf{\tilde{B}}(x), \mathbf{\tilde{C}}(x)$ corresponding to the off-diagonal entries of the monodromy matrix $T$ for the phase model and $i$-boson model in terms of bc fermions and neutral fermions respectively, thus giving alternative treatment of the KP and BKP hierarchies. We also introduce analogous operators $\mathbf{B}^{*}(x)$ and $\mathbf{C}^{*}(x)$ for the charged free boson system and show that they are in complete analogy to those of $bc$ fermionic fields. It is proved that the correlation function $\langle 0|\mathbf{C}(x_N)\cdots\mathbf{C}(x_1)\mathbf{B}(y_1)\cdots $ $\mathbf{B}(y_N)|0\rangle$ in the $bc$ fermionic fields is the inverse of the correlation function $\langle 0|\mathbf{C}^{*}(x_N)\cdots\mathbf{C}^{*}(x_1)\mathbf{B}^{*}(y_1)\cdots \mathbf{B}^{*}(y_N)|0\rangle$ in the charged free bosons. △ Less

Submitted 16 June, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

Comments: 26 pages. Final version for J. Stat. Mech

MSC Class: Primary: 17B37; Secondary: 58A17; 15A75; 15B33; 15A15; 05E05

Journal ref: J. Stat. Mech. (2020), 083101, 27pp

arXiv:1911.12516 [pdf, other]

doi 10.1093/biomet/asaa082

Optimal Estimation of Bacterial Growth Rates Based on Permuted Monotone Matrix

Authors: Rong Ma, T. Tony Cai, Hongzhe Li

Abstract: Motivated by the problem of estimating the bacterial growth rates for genome assemblies from shotgun metagenomic data, we consider the permuted monotone matrix model $Y=ΘΠ+Z$, where $Y\in \mathbb{R}^{n\times p}$ is observed, $Θ\in \mathbb{R}^{n\times p}$ is an unknown approximately rank-one signal matrix with monotone rows, $Π\in \mathbb{R}^{p\times p}$ is an unknown permutation matrix, and… ▽ More Motivated by the problem of estimating the bacterial growth rates for genome assemblies from shotgun metagenomic data, we consider the permuted monotone matrix model $Y=ΘΠ+Z$, where $Y\in \mathbb{R}^{n\times p}$ is observed, $Θ\in \mathbb{R}^{n\times p}$ is an unknown approximately rank-one signal matrix with monotone rows, $Π\in \mathbb{R}^{p\times p}$ is an unknown permutation matrix, and $Z\in \mathbb{R}^{n\times p}$ is the noise matrix. This paper studies the estimation of the extreme values associated to the signal matrix $Θ$, including its first and last columns, as well as their difference. Treating these estimation problems as compound decision problems, minimax rate-optimal estimators are constructed using the spectral column sorting method. Numerical experiments through simulated and synthetic microbiome metagenomic data are presented, showing the superiority of the proposed methods over the alternatives. The methods are illustrated by comparing the growth rates of gut bacteria between inflammatory bowel disease patients and normal controls. △ Less

Submitted 26 August, 2020; v1 submitted 27 November, 2019; originally announced November 2019.

Journal ref: Biometrika (2020)

arXiv:1911.11345 [pdf, other]

High Dimensional M-Estimation with Missing Outcomes: A Semi-Parametric Framework

Authors: Abhishek Chakrabortty, Jiarui Lu, T. Tony Cai, Hongzhe Li

Abstract: We consider high dimensional $M$-estimation in settings where the response $Y$ is possibly missing at random and the covariates $\mathbf{X} \in \mathbb{R}^p$ can be high dimensional compared to the sample size $n$. The parameter of interest $\boldsymbolθ_0 \in \mathbb{R}^d$ is defined as the minimizer of the risk of a convex loss, under a fully non-parametric model, and $\boldsymbolθ_0$ itself is… ▽ More We consider high dimensional $M$-estimation in settings where the response $Y$ is possibly missing at random and the covariates $\mathbf{X} \in \mathbb{R}^p$ can be high dimensional compared to the sample size $n$. The parameter of interest $\boldsymbolθ_0 \in \mathbb{R}^d$ is defined as the minimizer of the risk of a convex loss, under a fully non-parametric model, and $\boldsymbolθ_0$ itself is high dimensional which is a key distinction from existing works. Standard high dimensional regression and series estimation with possibly misspecified models and missing $Y$ are included as special cases, as well as their counterparts in causal inference using 'potential outcomes'. Assuming $\boldsymbolθ_0$ is $s$-sparse ($s \ll n$), we propose an $L_1$-regularized debiased and doubly robust (DDR) estimator of $\boldsymbolθ_0$ based on a high dimensional adaptation of the traditional double robust (DR) estimator's construction. Under mild tail assumptions and arbitrarily chosen (working) models for the propensity score (PS) and the outcome regression (OR) estimators, satisfying only some high-level conditions, we establish finite sample performance bounds for the DDR estimator showing its (optimal) $L_2$ error rate to be $\sqrt{s (\log d)/ n}$ when both models are correct, and its consistency and DR properties when only one of them is correct. Further, when both the models are correct, we propose a desparsified version of our DDR estimator that satisfies an asymptotic linear expansion and facilitates inference on low dimensional components of $\boldsymbolθ_0$. Finally, we discuss various of choices of high dimensional parametric/semi-parametric working models for the PS and OR estimators. All results are validated via detailed simulations. △ Less

Submitted 26 November, 2019; originally announced November 2019.

Comments: 34 pages, 4 tables; (Supplement: 58 pages, 10 tables);

arXiv:1911.10604 [pdf, other]

doi 10.1080/01621459.2020.1713794

Optimal Permutation Recovery in Permuted Monotone Matrix Model

Authors: Rong Ma, T. Tony Cai, Hongzhe Li

Abstract: Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model $Y=ΘΠ+Z$, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, $Θ$ is an unknown mean matrix… ▽ More Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model $Y=ΘΠ+Z$, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, $Θ$ is an unknown mean matrix with monotone entries for each row, $Π$ is a permutation matrix that permutes the columns of $Θ$, and $Z$ is a noise matrix. This paper studies the problem of estimation/recovery of $Π$ given the observed noisy matrix $Y$. We propose an estimator based on the best linear projection, which is shown to be minimax rate-optimal for both exact recovery, as measured by the 0-1 loss, and partial recovery, as quantified by the normalized Kendall's tau distance. Simulation studies demonstrate the superior empirical performance of the proposed estimator over alternative methods. We demonstrate the methods using a synthetic metagenomics dataset of 45 closely related bacterial species and a real metagenomic dataset to compare the bacterial growth dynamics between the responders and the non-responders of the IBD patients after 8 weeks of treatment. △ Less

Submitted 13 July, 2020; v1 submitted 24 November, 2019; originally announced November 2019.

Journal ref: Journal of the American Statistical Association, 2020

arXiv:1911.08176 [pdf, ps, other]

A Generalization of A Result of Gauss on Primitive Root

Authors: Hao Zhong, Tianxin Cai

Abstract: A primitive root modulo an integer $n$ is the generator of the multiplicative group of integers modulo $n$. Gauss proved that for any prime number $p$ greater than $3$, the sum of its primitive roots is congruent to $1$ modulo $p$ while its product is congruent to $μ(p-1)$ modulo $p$, where $μ$ is the Möbius function. In this paper, we will generalize these two interesting congruences and give the… ▽ More A primitive root modulo an integer $n$ is the generator of the multiplicative group of integers modulo $n$. Gauss proved that for any prime number $p$ greater than $3$, the sum of its primitive roots is congruent to $1$ modulo $p$ while its product is congruent to $μ(p-1)$ modulo $p$, where $μ$ is the Möbius function. In this paper, we will generalize these two interesting congruences and give the congruences of the sum and the product of integers with the same index modulo $n$. △ Less

Submitted 19 November, 2019; originally announced November 2019.

Comments: 9 pages

arXiv:1909.09851 [pdf, other]

Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference

Authors: T. Tony Cai, Anru R. Zhang, Yuchen Zhou

Abstract: We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model -- an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are establishe… ▽ More We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model -- an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results. △ Less

Submitted 6 May, 2022; v1 submitted 21 September, 2019; originally announced September 2019.

Comments: IEEE Transactions on Information Theory, to appear

arXiv:1908.05598 [pdf, ps, other]

On the divisor problem with congruence conditions

Authors: Lirui Jia, Wenguang Zhai, Tianxin Cai

Abstract: Let $d(n; r_1, q_1, r_2, q_2)$ be the number of factorization $n=n_1n_2$ satisfying $n_i\equiv r_i\pmod{q_i}$ ($i=1,2$) and $Δ(x; r_1, q_1, r_2, q_2)$ be the error term of the summatory function of $d(n; r_1, q_1, r_2, q_2)$ with $x\geq (q_1q_2)^{1+\varepsilon}, 1\leq r_i\leq q_i$, and $(r_i, q_i)=1$ ($i=1, 2$). We study the power moments and sign changes of $Δ(x; r_1, q_1, r_2, q_2)$, and prove t… ▽ More Let $d(n; r_1, q_1, r_2, q_2)$ be the number of factorization $n=n_1n_2$ satisfying $n_i\equiv r_i\pmod{q_i}$ ($i=1,2$) and $Δ(x; r_1, q_1, r_2, q_2)$ be the error term of the summatory function of $d(n; r_1, q_1, r_2, q_2)$ with $x\geq (q_1q_2)^{1+\varepsilon}, 1\leq r_i\leq q_i$, and $(r_i, q_i)=1$ ($i=1, 2$). We study the power moments and sign changes of $Δ(x; r_1, q_1, r_2, q_2)$, and prove that for a sufficiently large constant $C$, $Δ(q_1q_2x; r_1, q_1, r_2, q_2)$ changes sign in the interval $[T,T+C\sqrt{T}]$ for any large $T$. Meanwhile, we show that for a small constant $c'$, there exist infinitely many subintervals of length $c'\sqrt{T}\log^{-7}T$ in $[T,2T]$ where $\pm Δ(q_1q_2x; r_1, q_1, r_2, q_2)> c_5x^\frac{1}{4}$ always holds. △ Less

Submitted 14 August, 2019; originally announced August 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1603.04977

arXiv:1906.02903 [pdf, other]

Transfer Learning for Nonparametric Classification: Minimax Rate and Adaptive Classifier

Authors: T. Tony Cai, Hongji Wei

Abstract: Human learners have the natural ability to use knowledge gained in one setting for learning in a different but related setting. This ability to transfer knowledge from one task to another is essential for effective learning. In this paper, we study transfer learning in the context of nonparametric classification based on observations from different distributions under the posterior drift model, wh… ▽ More Human learners have the natural ability to use knowledge gained in one setting for learning in a different but related setting. This ability to transfer knowledge from one task to another is essential for effective learning. In this paper, we study transfer learning in the context of nonparametric classification based on observations from different distributions under the posterior drift model, which is a general framework and arises in many practical problems. We first establish the minimax rate of convergence and construct a rate-optimal two-sample weighted $K$-NN classifier. The results characterize precisely the contribution of the observations from the source distribution to the classification task under the target distribution. A data-driven adaptive classifier is then proposed and is shown to simultaneously attain within a logarithmic factor of the optimal rate over a large collection of parameter spaces. Simulation studies and real data applications are carried out where the numerical results further illustrate the theoretical analysis. Extensions to the case of multiple source distributions are also considered. △ Less

Submitted 7 June, 2019; originally announced June 2019.

arXiv:1905.11675 [pdf, ps, other]

Gram-Gauss-Newton Method: Learning Overparameterized Neural Networks for Regression Problems

Authors: Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, Liwei Wang

Abstract: First-order methods such as stochastic gradient descent (SGD) are currently the standard algorithm for training deep neural networks. Second-order methods, despite their better convergence rate, are rarely used in practice due to the prohibitive computational cost in calculating the second-order information. In this paper, we propose a novel Gram-Gauss-Newton (GGN) algorithm to train deep neural n… ▽ More First-order methods such as stochastic gradient descent (SGD) are currently the standard algorithm for training deep neural networks. Second-order methods, despite their better convergence rate, are rarely used in practice due to the prohibitive computational cost in calculating the second-order information. In this paper, we propose a novel Gram-Gauss-Newton (GGN) algorithm to train deep neural networks for regression problems with square loss. Our method draws inspiration from the connection between neural network optimization and kernel regression of neural tangent kernel (NTK). Different from typical second-order methods that have heavy computational cost in each iteration, GGN only has minor overhead compared to first-order methods such as SGD. We also give theoretical results to show that for sufficiently wide neural networks, the convergence rate of GGN is \emph{quadratic}. Furthermore, we provide convergence guarantee for mini-batch GGN algorithm, which is, to our knowledge, the first convergence result for the mini-batch version of a second-order method on overparameterized neural networks. Preliminary experiments on regression tasks demonstrate that for training standard networks, our GGN algorithm converges much faster and achieves better performance than SGD. △ Less

Submitted 25 September, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1905.08757 [pdf, other]

Asymptotic Analysis for Extreme Eigenvalues of Principal Minors of Random Matrices

Authors: T. Tony Cai, Tiefeng Jiang, Xiaoou Li

Abstract: Consider a standard white Wishart matrix with parameters $n$ and $p$. Motivated by applications in high-dimensional statistics and signal processing, we perform asymptotic analysis on the maxima and minima of the eigenvalues of all the $m \times m$ principal minors, under the asymptotic regime that $n,p,m$ go to infinity. Asymptotic results concerning extreme eigenvalues of principal minors of rea… ▽ More Consider a standard white Wishart matrix with parameters $n$ and $p$. Motivated by applications in high-dimensional statistics and signal processing, we perform asymptotic analysis on the maxima and minima of the eigenvalues of all the $m \times m$ principal minors, under the asymptotic regime that $n,p,m$ go to infinity. Asymptotic results concerning extreme eigenvalues of principal minors of real Wigner matrices are also obtained. In addition, we discuss an application of the theoretical results to the construction of compressed sensing matrices, which provides insights to compressed sensing in signal processing and high dimensional linear regression in statistics. △ Less

Submitted 21 May, 2019; originally announced May 2019.

arXiv:1904.12891 [pdf, other]

Optimal Statistical Inference for Individualized Treatment Effects in High-dimensional Models

Authors: Tianxi Cai, Tony Cai, Zijian Guo

Abstract: The ability to predict individualized treatment effects (ITEs) based on a given patient's profile is essential for personalized medicine. We propose a hypothesis testing approach to choosing between two potential treatments for a given individual in the framework of high-dimensional linear models. The methodological novelty lies in the construction of a debiased estimator of the ITE and establishm… ▽ More The ability to predict individualized treatment effects (ITEs) based on a given patient's profile is essential for personalized medicine. We propose a hypothesis testing approach to choosing between two potential treatments for a given individual in the framework of high-dimensional linear models. The methodological novelty lies in the construction of a debiased estimator of the ITE and establishment of its asymptotic normality uniformly for an arbitrary future high-dimensional observation, while the existing methods can only handle certain specific forms of observations. We introduce a testing procedure with the type-I error controlled and establish its asymptotic power. The proposed method can be extended to making inference for general linear contrasts, including both the average treatment effect and outcome prediction. We introduce the optimality framework for hypothesis testing from both the minimaxity and adaptivity perspectives and establish the optimality of the proposed procedure. An extension to high-dimensional approximate linear models is also considered. The finite sample performance of the procedure is demonstrated in simulation studies and further illustrated through an analysis of electronic health records data from patients with rheumatoid arthritis. △ Less

Submitted 7 August, 2020; v1 submitted 29 April, 2019; originally announced April 2019.

arXiv:1810.08316 [pdf, other]

Heteroskedastic PCA: Algorithm, Optimality, and Applications

Authors: Anru R. Zhang, T. Tony Cai, Yihong Wu

Abstract: A general framework for principal component analysis (PCA) in the presence of heteroskedastic noise is introduced. We propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries of the sample covariance matrix to remove estimation bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covaria… ▽ More A general framework for principal component analysis (PCA) in the presence of heteroskedastic noise is introduced. We propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries of the sample covariance matrix to remove estimation bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covariance model. A key technical step is a deterministic robust perturbation analysis on singular subspaces, which can be of independent interest. The effectiveness of the proposed algorithm is demonstrated in a suite of problems in high-dimensional statistics, including singular value decomposition (SVD) under heteroskedastic noise, Poisson PCA, and SVD for heteroskedastic and incomplete data. △ Less

Submitted 1 April, 2021; v1 submitted 18 October, 2018; originally announced October 2018.

Showing 1–50 of 143 results for author: Cai, T