-
Functional Tensor Regression
Authors:
Tongyu Li,
Fang Yao,
Anru R. Zhang
Abstract:
Tensor regression has attracted significant attention in statistical research. This study tackles the challenge of handling covariates with smooth varying structures. We introduce a novel framework, termed functional tensor regression, which incorporates both the tensor and functional aspects of the covariate. To address the high dimensionality and functional continuity of the regression coefficie…
▽ More
Tensor regression has attracted significant attention in statistical research. This study tackles the challenge of handling covariates with smooth varying structures. We introduce a novel framework, termed functional tensor regression, which incorporates both the tensor and functional aspects of the covariate. To address the high dimensionality and functional continuity of the regression coefficient, we employ a low Tucker rank decomposition along with smooth regularization for the functional mode. We develop a functional Riemannian Gauss--Newton algorithm that demonstrates a provable quadratic convergence rate, while the estimation error bound is based on the tensor covariate dimension. Simulations and a neuroimaging analysis illustrate the finite sample performance of the proposed method.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Integrated Analysis for Electronic Health Records with Structured and Sporadic Missingness
Authors:
Jianbin Tan,
Yan Zhang,
Chuan Hong,
T. Tony Cai,
Tianxi Cai,
Anru R. Zhang
Abstract:
Objectives: We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understa…
▽ More
Objectives: We propose a novel imputation method tailored for Electronic Health Records (EHRs) with structured and sporadic missingness. Such missingness frequently arises in the integration of heterogeneous EHR datasets for downstream clinical applications. By addressing these gaps, our method provides a practical solution for integrated analysis, enhancing data utility and advancing the understanding of population health.
Materials and Methods: We begin by demonstrating structured and sporadic missing mechanisms in the integrated analysis of EHR data. Following this, we introduce a novel imputation framework, Macomss, specifically designed to handle structurally and heterogeneously occurring missing data. We establish theoretical guarantees for Macomss, ensuring its robustness in preserving the integrity and reliability of integrated analyses. To assess its empirical performance, we conduct extensive simulation studies that replicate the complex missingness patterns observed in real-world EHR systems, complemented by validation using EHR datasets from the Duke University Health System (DUHS).
Results: Simulation studies show that our approach consistently outperforms existing imputation methods. Using datasets from three hospitals within DUHS, Macomss achieves the lowest imputation errors for missing data in most cases and provides superior or comparable downstream prediction performance compared to benchmark methods.
Conclusions: We provide a theoretically guaranteed and practically meaningful method for imputing structured and sporadic missing data, enabling accurate and reliable integrated analysis across multiple EHR datasets. The proposed approach holds significant potential for advancing research in population health.
△ Less
Submitted 10 June, 2025;
originally announced June 2025.
-
Revisit CP Tensor Decomposition: Statistical Optimality and Fast Convergence
Authors:
Runshi Tang,
Julien Chhor,
Olga Klopp,
Anru R. Zhang
Abstract:
Canonical Polyadic (CP) tensor decomposition is a fundamental technique for analyzing high-dimensional tensor data. While the Alternating Least Squares (ALS) algorithm is widely used for computing CP decomposition due to its simplicity and empirical success, its theoretical foundation, particularly regarding statistical optimality and convergence behavior, remain underdeveloped, especially in nois…
▽ More
Canonical Polyadic (CP) tensor decomposition is a fundamental technique for analyzing high-dimensional tensor data. While the Alternating Least Squares (ALS) algorithm is widely used for computing CP decomposition due to its simplicity and empirical success, its theoretical foundation, particularly regarding statistical optimality and convergence behavior, remain underdeveloped, especially in noisy, non-orthogonal, and higher-rank settings.
In this work, we revisit CP tensor decomposition from a statistical perspective and provide a comprehensive theoretical analysis of ALS under a signal-plus-noise model. We establish non-asymptotic, minimax-optimal error bounds for tensors of general order, dimensions, and rank, assuming suitable initialization. To enable such initialization, we propose Tucker-based Approximation with Simultaneous Diagonalization (TASD), a robust method that improves stability and accuracy in noisy regimes. Combined with ALS, TASD yields a statistically consistent estimator. We further analyze the convergence dynamics of ALS, identifying a two-phase pattern-initial quadratic convergence followed by linear refinement. We further show that in the rank-one setting, ALS with an appropriately chosen initialization attains optimal error within just one or two iterations.
△ Less
Submitted 28 May, 2025;
originally announced May 2025.
-
Subtype-Aware Registration of Longitudinal Electronic Health Records
Authors:
Xin Gai,
Shiyi Jiang,
Anru R. Zhang
Abstract:
Electronic Health Records (EHRs) contain extensive patient information that can inform downstream clinical decisions, such as mortality prediction, disease phenotyping, and disease onset prediction. A key challenge in EHR data analysis is the temporal gap between when a condition is first recorded and its actual onset time. Such timeline misalignment can lead to artificially distinct biomarker tre…
▽ More
Electronic Health Records (EHRs) contain extensive patient information that can inform downstream clinical decisions, such as mortality prediction, disease phenotyping, and disease onset prediction. A key challenge in EHR data analysis is the temporal gap between when a condition is first recorded and its actual onset time. Such timeline misalignment can lead to artificially distinct biomarker trends among patients with similar disease progression, undermining the reliability of downstream analysis and complicating tasks like disease subtyping. To address this challenge, we provide a subtype-aware timeline registration method that leverages data projection and discrete optimization to simultaneously correct timeline misalignment and improve disease subtyping. Through simulation and real-world data analyses, we demonstrate that the proposed method effectively aligns distorted observed records with the true disease progression patterns, enhancing subtyping clarity and improving performance in downstream clinical analyses.
△ Less
Submitted 13 January, 2025;
originally announced January 2025.
-
High-order Accurate Inference on Manifolds
Authors:
Chengzhu Huang,
Anru R. Zhang
Abstract:
We present a new framework for statistical inference on Riemannian manifolds that achieves high-order accuracy, addressing the challenges posed by non-Euclidean parameter spaces frequently encountered in modern data science. Our approach leverages a novel and computationally efficient procedure to reach higher-order asymptotic precision. In particular, we develop a bootstrap algorithm on Riemannia…
▽ More
We present a new framework for statistical inference on Riemannian manifolds that achieves high-order accuracy, addressing the challenges posed by non-Euclidean parameter spaces frequently encountered in modern data science. Our approach leverages a novel and computationally efficient procedure to reach higher-order asymptotic precision. In particular, we develop a bootstrap algorithm on Riemannian manifolds that is both computationally efficient and accurate for hypothesis testing and confidence region construction. Although locational hypothesis testing can be reformulated as a standard Euclidean problem, constructing high-order accurate confidence regions necessitates careful treatment of manifold geometry. To this end, we establish high-order asymptotics under a fixed normal chart centered at the true parameter, thereby enabling precise expansions that incorporate curvature effects. We demonstrate the versatility of this framework across various manifold settings-including spheres, the Stiefel manifold, fixed-rank matrices manifolds, and rank-one tensor manifolds-and, for Euclidean submanifolds, introduce a class of projection-like coordinate charts with strong consistency properties. Finally, numerical studies confirm the practical merits of the proposed procedure.
△ Less
Submitted 26 January, 2025; v1 submitted 11 January, 2025;
originally announced January 2025.
-
Federated PCA and Estimation for Spiked Covariance Matrices: Optimal Rates and Efficient Algorithm
Authors:
Jingyang Li,
T. Tony Cai,
Dong Xia,
Anru R. Zhang
Abstract:
Federated Learning (FL) has gained significant recent attention in machine learning for its enhanced privacy and data security, making it indispensable in fields such as healthcare, finance, and personalized services. This paper investigates federated PCA and estimation for spiked covariance matrices under distributed differential privacy constraints. We establish minimax rates of convergence, wit…
▽ More
Federated Learning (FL) has gained significant recent attention in machine learning for its enhanced privacy and data security, making it indispensable in fields such as healthcare, finance, and personalized services. This paper investigates federated PCA and estimation for spiked covariance matrices under distributed differential privacy constraints. We establish minimax rates of convergence, with a key finding that the central server's optimal rate is the harmonic mean of the local clients' minimax rates. This guarantees consistent estimation at the central server as long as at least one local client provides consistent results. Notably, consistency is maintained even if some local estimators are inconsistent, provided there are enough clients. These findings highlight the robustness and scalability of FL for reliable statistical inference under privacy constraints. To establish minimax lower bounds, we derive a matrix version of van Trees' inequality, which is of independent interest. Furthermore, we propose an efficient algorithm that preserves differential privacy while achieving near-optimal rates at the central server, up to a logarithmic factor. We address significant technical challenges in analyzing this algorithm, which involves a three-layer spectral decomposition. Numerical performance of the proposed algorithm is investigated using both simulated and real data.
△ Less
Submitted 23 November, 2024;
originally announced November 2024.
-
Tensor Decomposition with Unaligned Observations
Authors:
Runshi Tang,
Tamara Kolda,
Anru R. Zhang
Abstract:
This paper presents a canonical polyadic (CP) tensor decomposition that addresses unaligned observations. The mode with unaligned observations is represented using functions in a reproducing kernel Hilbert space (RKHS). We introduce a versatile loss function that effectively accounts for various types of data, including binary, integer-valued, and positive-valued types. Additionally, we propose an…
▽ More
This paper presents a canonical polyadic (CP) tensor decomposition that addresses unaligned observations. The mode with unaligned observations is represented using functions in a reproducing kernel Hilbert space (RKHS). We introduce a versatile loss function that effectively accounts for various types of data, including binary, integer-valued, and positive-valued types. Additionally, we propose an optimization algorithm for computing tensor decompositions with unaligned observations, along with a stochastic gradient method to enhance computational efficiency. A sketching algorithm is also introduced to further improve efficiency when using the $\ell_2$ loss function. To demonstrate the efficacy of our methods, we provide illustrative examples using both synthetic data and an early childhood human microbiome dataset.
△ Less
Submitted 17 October, 2024;
originally announced October 2024.
-
Functional Singular Value Decomposition
Authors:
Jianbin Tan,
Pixu Shi,
Anru R. Zhang
Abstract:
Heterogeneous functional data commonly arise in time series and longitudinal studies. To uncover the statistical structures of such data, we propose Functional Singular Value Decomposition (FSVD), a unified framework encompassing various tasks for the analysis of functional data with potential heterogeneity. We establish the mathematical foundation of FSVD by proving its existence and providing it…
▽ More
Heterogeneous functional data commonly arise in time series and longitudinal studies. To uncover the statistical structures of such data, we propose Functional Singular Value Decomposition (FSVD), a unified framework encompassing various tasks for the analysis of functional data with potential heterogeneity. We establish the mathematical foundation of FSVD by proving its existence and providing its fundamental properties. We then develop an implementation approach for noisy and irregularly observed functional data based on a novel alternating minimization scheme and provide theoretical guarantees for its convergence and estimation accuracy. The FSVD framework also introduces the concepts of intrinsic basis functions and intrinsic basis vectors, representing two fundamental structural aspects of random functions. These concepts enable FSVD to provide new and improved solutions to tasks including functional principal component analysis, factor models, functional clustering, functional linear regression, and functional completion, while effectively handling heterogeneity and irregular temporal sampling. Through extensive simulations, we demonstrate that FSVD-based methods consistently outperform existing methods across these tasks. To showcase the value of FSVD in real-world datasets, we apply it to extract temporal patterns from a COVID-19 case count dataset and perform data completion on an electronic health record dataset.
△ Less
Submitted 16 February, 2025; v1 submitted 4 October, 2024;
originally announced October 2024.
-
Functional Post-Clustering Selective Inference with Applications to EHR Data Analysis
Authors:
Zihan Zhu,
Xin Gai,
Anru R. Zhang
Abstract:
In electronic health records (EHR) analysis, clustering patients according to patterns in their data is crucial for uncovering new subtypes of diseases. Existing medical literature often relies on classical hypothesis testing methods to test for differences in means between these clusters. Due to selection bias induced by clustering algorithms, the implementation of these classical methods on post…
▽ More
In electronic health records (EHR) analysis, clustering patients according to patterns in their data is crucial for uncovering new subtypes of diseases. Existing medical literature often relies on classical hypothesis testing methods to test for differences in means between these clusters. Due to selection bias induced by clustering algorithms, the implementation of these classical methods on post-clustering data often leads to an inflated type-I error. In this paper, we introduce a new statistical approach that adjusts for this bias when analyzing data collected over time. Our method extends classical selective inference methods for cross-sectional data to longitudinal data. We provide theoretical guarantees for our approach with upper bounds on the selective type-I and type-II errors. We apply the method to simulated data and real-world Acute Kidney Injury (AKI) EHR datasets, thereby illustrating the advantages of our approach.
△ Less
Submitted 5 May, 2024;
originally announced May 2024.
-
Blessing of dimension in Bayesian inference on covariance matrices
Authors:
Shounak Chattopadhyay,
Anru R. Zhang,
David B. Dunson
Abstract:
Bayesian factor analysis is routinely used for dimensionality reduction in modeling of high-dimensional covariance matrices. Factor analytic decompositions express the covariance as a sum of a low rank and diagonal matrix. In practice, Gibbs sampling algorithms are typically used for posterior computation, alternating between updating the latent factors, loadings, and residual variances. In this a…
▽ More
Bayesian factor analysis is routinely used for dimensionality reduction in modeling of high-dimensional covariance matrices. Factor analytic decompositions express the covariance as a sum of a low rank and diagonal matrix. In practice, Gibbs sampling algorithms are typically used for posterior computation, alternating between updating the latent factors, loadings, and residual variances. In this article, we exploit a blessing of dimensionality to develop a provably accurate pseudo-posterior for the covariance matrix that bypasses the need for Gibbs or other variants of Markov chain Monte Carlo sampling. Our proposed Factor Analysis with BLEssing of dimensionality (FABLE) approach relies on a first-stage singular value decomposition (SVD) to estimate the latent factors, and then defines a jointly conjugate prior for the loadings and residual variances. The accuracy of the resulting pseudo-posterior for the covariance improves with increasing dimensionality. We show that FABLE has excellent performance in high-dimensional covariance matrix estimation, including producing well calibrated credible intervals, both theoretically and through simulation experiments. We also demonstrate the strength of our approach in terms of accurate inference and computational efficiency by applying it to a gene expression data set.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Soft Phenotyping for Sepsis via EHR Time-aware Soft Clustering
Authors:
Shiyi Jiang,
Xin Gai,
Miriam Treggiari,
William W. Stead,
Yuankang Zhao,
C. David Page,
Anru R. Zhang
Abstract:
Objective: Sepsis is one of the most serious hospital conditions associated with high mortality. Sepsis is the result of a dysregulated immune response to infection that can lead to multiple organ dysfunction and death. Due to the wide variability in the causes of sepsis, clinical presentation, and the recovery trajectories, identifying sepsis sub-phenotypes is crucial to advance our understanding…
▽ More
Objective: Sepsis is one of the most serious hospital conditions associated with high mortality. Sepsis is the result of a dysregulated immune response to infection that can lead to multiple organ dysfunction and death. Due to the wide variability in the causes of sepsis, clinical presentation, and the recovery trajectories, identifying sepsis sub-phenotypes is crucial to advance our understanding of sepsis characterization, to choose targeted treatments and optimal timing of interventions, and to improve prognostication. Prior studies have described different sub-phenotypes of sepsis using organ-specific characteristics. These studies applied clustering algorithms to electronic health records (EHRs) to identify disease sub-phenotypes. However, prior approaches did not capture temporal information and made uncertain assumptions about the relationships among the sub-phenotypes for clustering procedures.
Methods: We developed a time-aware soft clustering algorithm guided by clinical variables to identify sepsis sub-phenotypes using data available in the EHR.
Results: We identified six novel sepsis hybrid sub-phenotypes and evaluated them for medical plausibility. In addition, we built an early-warning sepsis prediction model using logistic regression.
Conclusion: Our results suggest that these novel sepsis hybrid sub-phenotypes are promising to provide more accurate information on sepsis-related organ dysfunction and sepsis recovery trajectories which can be important to inform management decisions and sepsis prognosis.
△ Less
Submitted 5 May, 2024; v1 submitted 14 November, 2023;
originally announced November 2023.
-
Cocaine Use Prediction with Tensor-based Machine Learning on Multimodal MRI Connectome Data
Authors:
Anru R. Zhang,
Ryan P. Bell,
Chen An,
Runshi Tang,
Shana A. Hall,
Cliburn Chan,
Kareem Al-Khalil,
Christina S. Meade
Abstract:
This paper considers the use of machine learning algorithms for predicting cocaine use based on magnetic resonance imaging (MRI) connectomic data. The study utilized functional MRI (fMRI) and diffusion MRI (dMRI) data collected from 275 individuals, which was then parcellated into 246 regions of interest (ROIs) using the Brainnetome atlas. After data preprocessing, the datasets were transformed in…
▽ More
This paper considers the use of machine learning algorithms for predicting cocaine use based on magnetic resonance imaging (MRI) connectomic data. The study utilized functional MRI (fMRI) and diffusion MRI (dMRI) data collected from 275 individuals, which was then parcellated into 246 regions of interest (ROIs) using the Brainnetome atlas. After data preprocessing, the datasets were transformed into tensor form. We developed a tensor-based unsupervised machine learning algorithm to reduce the size of the data tensor from $275$ (individuals) $\times 2$ (fMRI and dMRI) $\times 246$ (ROIs) $\times 246$ (ROIs) to $275$ (individuals) $\times 2$ (fMRI and dMRI) $\times 6$ (clusters) $\times 6$ (clusters). This was achieved by applying the high-order Lloyd algorithm to group the ROI data into 6 clusters. Features were extracted from the reduced tensor and combined with demographic features (age, gender, race, and HIV status). The resulting dataset was used to train a Catboost model using subsampling and nested cross-validation techniques, which achieved a prediction accuracy of 0.857 for identifying cocaine users. The model was also compared with other models, and the feature importance of the model was presented.
Overall, this study highlights the potential for using tensor-based machine learning algorithms to predict cocaine use based on MRI connectomic data and presents a promising approach for identifying individuals at risk of substance abuse.
△ Less
Submitted 21 October, 2023;
originally announced October 2023.
-
Mode-wise Principal Subspace Pursuit and Matrix Spiked Covariance Model
Authors:
Runshi Tang,
Ming Yuan,
Anru R. Zhang
Abstract:
This paper introduces a novel framework called Mode-wise Principal Subspace Pursuit (MOP-UP) to extract hidden variations in both the row and column dimensions for matrix data. To enhance the understanding of the framework, we introduce a class of matrix-variate spiked covariance models that serve as inspiration for the development of the MOP-UP algorithm. The MOP-UP algorithm consists of two step…
▽ More
This paper introduces a novel framework called Mode-wise Principal Subspace Pursuit (MOP-UP) to extract hidden variations in both the row and column dimensions for matrix data. To enhance the understanding of the framework, we introduce a class of matrix-variate spiked covariance models that serve as inspiration for the development of the MOP-UP algorithm. The MOP-UP algorithm consists of two steps: Average Subspace Capture (ASC) and Alternating Projection (AP). These steps are specifically designed to capture the row-wise and column-wise dimension-reduced subspaces which contain the most informative features of the data. ASC utilizes a novel average projection operator as initialization and achieves exact recovery in the noiseless setting. We analyze the convergence and non-asymptotic error bounds of MOP-UP, introducing a blockwise matrix eigenvalue perturbation bound that proves the desired bound, where classic perturbation bounds fail. The effectiveness and practical merits of the proposed framework are demonstrated through experiments on both simulated and real datasets. Lastly, we discuss generalizations of our approach to higher-order data.
△ Less
Submitted 4 August, 2024; v1 submitted 2 July, 2023;
originally announced July 2023.
-
Phase transition for detecting a small community in a large network
Authors:
Jiashun Jin,
Zheng Tracy Ke,
Paxton Turner,
Anru R. Zhang
Abstract:
How to detect a small community in a large network is an interesting problem, including clique detection as a special case, where a naive degree-based $χ^2$-test was shown to be powerful in the presence of an Erdős-Renyi background. Using Sinkhorn's theorem, we show that the signal captured by the $χ^2$-test may be a modeling artifact, and it may disappear once we replace the Erdős-Renyi model by…
▽ More
How to detect a small community in a large network is an interesting problem, including clique detection as a special case, where a naive degree-based $χ^2$-test was shown to be powerful in the presence of an Erdős-Renyi background. Using Sinkhorn's theorem, we show that the signal captured by the $χ^2$-test may be a modeling artifact, and it may disappear once we replace the Erdős-Renyi model by a broader network model. We show that the recent SgnQ test is more appropriate for such a setting. The test is optimal in detecting communities with sizes comparable to the whole network, but has never been studied for our setting, which is substantially different and more challenging. Using a degree-corrected block model (DCBM), we establish phase transitions of this testing problem concerning the size of the small community and the edge densities in small and large communities. When the size of the small community is larger than $\sqrt{n}$, the SgnQ test is optimal for it attains the computational lower bound (CLB), the information lower bound for methods allowing polynomial computation time. When the size of the small community is smaller than $\sqrt{n}$, we establish the parameter regime where the SgnQ test has full power and make some conjectures of the CLB. We also study the classical information lower bound (LB) and show that there is always a gap between the CLB and LB in our range of interest.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
Enhancing convolutional neural network generalizability via low-rank weight approximation
Authors:
Chenyin Gao,
Shu Yang,
Anru R. Zhang
Abstract:
Noise is ubiquitous during image acquisition. Sufficient denoising is often an important first step for image processing. In recent decades, deep neural networks (DNNs) have been widely used for image denoising. Most DNN-based image denoising methods require a large-scale dataset or focus on supervised settings, in which single/pairs of clean images or a set of noisy images are required. This pose…
▽ More
Noise is ubiquitous during image acquisition. Sufficient denoising is often an important first step for image processing. In recent decades, deep neural networks (DNNs) have been widely used for image denoising. Most DNN-based image denoising methods require a large-scale dataset or focus on supervised settings, in which single/pairs of clean images or a set of noisy images are required. This poses a significant burden on the image acquisition process. Moreover, denoisers trained on datasets of limited scale may incur over-fitting. To mitigate these issues, we introduce a new self-supervised framework for image denoising based on the Tucker low-rank tensor approximation. With the proposed design, we are able to characterize our denoiser with fewer parameters and train it based on a single image, which considerably improves the model's generalizability and reduces the cost of data acquisition. Extensive experiments on both synthetic and real-world noisy images have been conducted. Empirical results show that our proposed method outperforms existing non-learning-based methods (e.g., low-pass filter, non-local mean), single-image unsupervised denoisers (e.g., DIP, NN+BM3D) evaluated on both in-sample and out-sample datasets. The proposed method even achieves comparable performances with some supervised methods (e.g., DnCNN).
△ Less
Submitted 1 August, 2024; v1 submitted 26 September, 2022;
originally announced September 2022.
-
Core Shrinkage Covariance Estimation for Matrix-variate Data
Authors:
Peter Hoff,
Andrew McCormack,
Anru R. Zhang
Abstract:
A separable covariance model for a random matrix provides a parsimonious description of the covariances among the rows and among the columns of the matrix, and permits likelihood-based inference with a very small sample size. However, in many applications the assumption of exact separability is unlikely to be met, and data analysis with a separable model may overlook or misrepresent important depe…
▽ More
A separable covariance model for a random matrix provides a parsimonious description of the covariances among the rows and among the columns of the matrix, and permits likelihood-based inference with a very small sample size. However, in many applications the assumption of exact separability is unlikely to be met, and data analysis with a separable model may overlook or misrepresent important dependence patterns in the data. In this article, we propose a compromise between separable and unstructured covariance estimation. We show how the set of covariance matrices may be uniquely parametrized in terms of the set of separable covariance matrices and a complementary set of "core" covariance matrices, where the core of a separable covariance matrix is the identity matrix. This parametrization defines a Kronecker-core decomposition of a covariance matrix. By shrinking the core of the sample covariance matrix with an empirical Bayes procedure, we obtain an estimator that can adapt to the degree of separability of the population covariance matrix.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay
Authors:
Yuetian Luo,
Anru R. Zhang
Abstract:
We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We p…
▽ More
We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also prove the statistical-computational gap in scalar-on-tensor regression by a direct low-degree polynomial argument. Our theory demonstrates a "blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially "cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.
△ Less
Submitted 15 January, 2024; v1 submitted 17 June, 2022;
originally announced June 2022.
-
Learning Polynomial Transformations
Authors:
Sitan Chen,
Jerry Li,
Yuanzhi Li,
Anru R. Zhang
Abstract:
We consider the problem of learning high dimensional polynomial transformations of Gaussians. Given samples of the form $p(x)$, where $x\sim N(0, \mathrm{Id}_r)$ is hidden and $p: \mathbb{R}^r \to \mathbb{R}^d$ is a function where every output coordinate is a low-degree polynomial, the goal is to learn the distribution over $p(x)$. This problem is natural in its own right, but is also an important…
▽ More
We consider the problem of learning high dimensional polynomial transformations of Gaussians. Given samples of the form $p(x)$, where $x\sim N(0, \mathrm{Id}_r)$ is hidden and $p: \mathbb{R}^r \to \mathbb{R}^d$ is a function where every output coordinate is a low-degree polynomial, the goal is to learn the distribution over $p(x)$. This problem is natural in its own right, but is also an important special case of learning deep generative models, namely pushforwards of Gaussians under two-layer neural networks with polynomial activations. Understanding the learnability of such generative models is crucial to understanding why they perform so well in practice.
Our first main result is a polynomial-time algorithm for learning quadratic transformations of Gaussians in a smoothed setting. Our second main result is a polynomial-time algorithm for learning constant-degree polynomial transformations of Gaussian in a smoothed setting, when the rank of the associated tensors is small. In fact our results extend to any rotation-invariant input distribution, not just Gaussian. These are the first end-to-end guarantees for learning a pushforward under a neural network with more than one layer.
Along the way, we also give the first polynomial-time algorithms with provable guarantees for tensor ring decomposition, a popular generalization of tensor decomposition that is used in practice to implicitly store large tensors.
△ Less
Submitted 8 April, 2022;
originally announced April 2022.
-
Guaranteed Functional Tensor Singular Value Decomposition
Authors:
Rungang Han,
Pixu Shi,
Anru R. Zhang
Abstract:
This paper introduces the functional tensor singular value decomposition (FTSVD), a novel dimension reduction framework for tensors with one functional mode and several tabular modes. The problem is motivated by high-order longitudinal data analysis. Our model assumes the observed data to be a random realization of an approximate CP low-rank functional tensor measured on a discrete time grid. Inco…
▽ More
This paper introduces the functional tensor singular value decomposition (FTSVD), a novel dimension reduction framework for tensors with one functional mode and several tabular modes. The problem is motivated by high-order longitudinal data analysis. Our model assumes the observed data to be a random realization of an approximate CP low-rank functional tensor measured on a discrete time grid. Incorporating tensor algebra and the theory of Reproducing Kernel Hilbert Space (RKHS), we propose a novel RKHS-based constrained power iteration with spectral initialization. Our method can successfully estimate both singular vectors and functions of the low-rank structure in the observed data. With mild assumptions, we establish the non-asymptotic contractive error bounds for the proposed algorithm. The superiority of the proposed framework is demonstrated via extensive experiments on both simulated and real data.
△ Less
Submitted 25 October, 2023; v1 submitted 9 August, 2021;
originally announced August 2021.
-
Nonconvex Factorization and Manifold Formulations are Almost Equivalent in Low-rank Matrix Optimization
Authors:
Yuetian Luo,
Xudong Li,
Anru R. Zhang
Abstract:
In this paper, we consider the geometric landscape connection of the widely studied manifold and factorization formulations in low-rank positive semidefinite (PSD) and general matrix optimization. We establish a sandwich relation on the spectrum of Riemannian and Euclidean Hessians at first-order stationary points (FOSPs). As a result of that, we obtain an equivalence on the set of FOSPs, second-o…
▽ More
In this paper, we consider the geometric landscape connection of the widely studied manifold and factorization formulations in low-rank positive semidefinite (PSD) and general matrix optimization. We establish a sandwich relation on the spectrum of Riemannian and Euclidean Hessians at first-order stationary points (FOSPs). As a result of that, we obtain an equivalence on the set of FOSPs, second-order stationary points (SOSPs) and strict saddles between the manifold and the factorization formulations. In addition, we show the sandwich relation can be used to transfer more quantitative geometric properties from one formulation to another. Similarities and differences in the landscape connection under the PSD case and the general case are discussed. To the best of our knowledge, this is the first geometric landscape connection between the manifold and the factorization formulations for handling rank constraints, and it provides a geometric explanation for the similar empirical performance of factorization and manifold approaches in low-rank matrix optimization observed in the literature. In the general low-rank matrix optimization, the landscape connection of two factorization formulations (unregularized and regularized ones) is also provided. By applying these geometric landscape connections, in particular, the sandwich relation, we are able to solve unanswered questions in literature and establish stronger results in the applications on geometric analysis of phase retrieval, well-conditioned low-rank matrix optimization, and the role of regularization in factorization arising from machine learning and signal processing.
△ Less
Submitted 12 August, 2024; v1 submitted 3 August, 2021;
originally announced August 2021.
-
Low-rank Tensor Estimation via Riemannian Gauss-Newton: Statistical Optimality and Second-Order Convergence
Authors:
Yuetian Luo,
Anru R. Zhang
Abstract:
In this paper, we consider the estimation of a low Tucker rank tensor from a number of noisy linear measurements. The general problem covers many specific examples arising from applications, including tensor regression, tensor completion, and tensor PCA/SVD. We consider an efficient Riemannian Gauss-Newton (RGN) method for low Tucker rank tensor estimation. Different from the generic (super)linear…
▽ More
In this paper, we consider the estimation of a low Tucker rank tensor from a number of noisy linear measurements. The general problem covers many specific examples arising from applications, including tensor regression, tensor completion, and tensor PCA/SVD. We consider an efficient Riemannian Gauss-Newton (RGN) method for low Tucker rank tensor estimation. Different from the generic (super)linear convergence guarantee of RGN in the literature, we prove the first local quadratic convergence guarantee of RGN for low-rank tensor estimation in the noisy setting under some regularity conditions and provide the corresponding estimation error upper bounds. A deterministic estimation error lower bound, which matches the upper bound, is provided that demonstrates the statistical optimality of RGN. The merit of RGN is illustrated through two machine learning applications: tensor regression and tensor SVD. Finally, we provide the simulation results to corroborate our theoretical findings.
△ Less
Submitted 8 July, 2023; v1 submitted 24 April, 2021;
originally announced April 2021.
-
Inference for Low-rank Tensors -- No Need to Debias
Authors:
Dong Xia,
Anru R. Zhang,
Yuchen Zhou
Abstract:
In this paper, we consider the statistical inference for several low-rank tensor models. Specifically, in the Tucker low-rank tensor PCA or regression model, provided with any estimates achieving some attainable error rate, we develop the data-driven confidence regions for the singular subspace of the parameter tensor based on the asymptotic distribution of an updated estimate by two-iteration alt…
▽ More
In this paper, we consider the statistical inference for several low-rank tensor models. Specifically, in the Tucker low-rank tensor PCA or regression model, provided with any estimates achieving some attainable error rate, we develop the data-driven confidence regions for the singular subspace of the parameter tensor based on the asymptotic distribution of an updated estimate by two-iteration alternating minimization. The asymptotic distributions are established under some essential conditions on the signal-to-noise ratio (in PCA model) or sample size (in regression model). If the parameter tensor is further orthogonally decomposable, we develop the methods and non-asymptotic theory for inference on each individual singular vector. For the rank-one tensor PCA model, we establish the asymptotic distribution for general linear forms of principal components and confidence interval for each entry of the parameter tensor. Finally, numerical simulations are presented to corroborate our theoretical discoveries.
In all these models, we observe that different from many matrix/vector settings in existing work, debiasing is not required to establish the asymptotic distribution of estimates or to make statistical inference on low-rank tensors. In fact, due to the widely observed statistical-computational-gap for low-rank tensor estimation, one usually requires stronger conditions than the statistical (or information-theoretic) limit to ensure the computationally feasible estimation is achievable. Surprisingly, such conditions ``incidentally" render a feasible low-rank tensor inference without debiasing.
△ Less
Submitted 29 October, 2021; v1 submitted 29 December, 2020;
originally announced December 2020.
-
Exact Clustering in Tensor Block Model: Statistical Optimality and Computational Limit
Authors:
Rungang Han,
Yuetian Luo,
Miaoyan Wang,
Anru R. Zhang
Abstract:
High-order clustering aims to identify heterogeneous substructures in multiway datasets that arise commonly in neuroimaging, genomics, social network studies, etc. The non-convex and discontinuous nature of this problem pose significant challenges in both statistics and computation. In this paper, we propose a tensor block model and the computationally efficient methods, \emph{high-order Lloyd alg…
▽ More
High-order clustering aims to identify heterogeneous substructures in multiway datasets that arise commonly in neuroimaging, genomics, social network studies, etc. The non-convex and discontinuous nature of this problem pose significant challenges in both statistics and computation. In this paper, we propose a tensor block model and the computationally efficient methods, \emph{high-order Lloyd algorithm} (HLloyd), and high-order spectral clustering (HSC), for high-order clustering. The convergence guarantees and statistical optimality are established for the proposed procedure under a mild sub-Gaussian noise assumption. Under the Gaussian tensor block model, we completely characterize the statistical-computational trade-off for achieving high-order exact clustering based on three different signal-to-noise ratio regimes. The analysis relies on new techniques of high-order spectral perturbation analysis and a ``singular-value-gap-free'' error bound in tensor estimation, which are substantially different from the matrix spectral analyses in the literature. Finally, we show the merits of the proposed procedures via extensive experiments on both synthetic and real datasets.
△ Less
Submitted 10 October, 2022; v1 submitted 17 December, 2020;
originally announced December 2020.
-
Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-order Convergence
Authors:
Yuetian Luo,
Wen Huang,
Xudong Li,
Anru R. Zhang
Abstract:
In this paper, we propose {\it \underline{R}ecursive} {\it \underline{I}mportance} {\it \underline{S}ketching} algorithm for {\it \underline{R}ank} constrained least squares {\it \underline{O}ptimization} (RISRO). The key step of RISRO is recursive importance sketching, a new sketching framework based on deterministically designed recursive projections, which significantly differs from the randomi…
▽ More
In this paper, we propose {\it \underline{R}ecursive} {\it \underline{I}mportance} {\it \underline{S}ketching} algorithm for {\it \underline{R}ank} constrained least squares {\it \underline{O}ptimization} (RISRO). The key step of RISRO is recursive importance sketching, a new sketching framework based on deterministically designed recursive projections, which significantly differs from the randomized sketching in the literature \citep{mahoney2011randomized,woodruff2014sketching}. Several existing algorithms in the literature can be reinterpreted under this new sketching framework and RISRO offers clear advantages over them. RISRO is easy to implement and computationally efficient, where the core procedure in each iteration is to solve a dimension-reduced least squares problem. We establish the local quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions. We also discover a deep connection of RISRO to the Riemannian Gauss-Newton algorithm on fixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machine learning and statistics: low-rank matrix trace regression and phase retrieval. Simulation studies demonstrate the superior numerical performance of RISRO.
△ Less
Submitted 4 December, 2022; v1 submitted 16 November, 2020;
originally announced November 2020.
-
Optimal High-order Tensor SVD via Tensor-Train Orthogonal Iteration
Authors:
Yuchen Zhou,
Anru R. Zhang,
Lili Zheng,
Yazhen Wang
Abstract:
This paper studies a general framework for high-order tensor SVD. We propose a new computationally efficient algorithm, tensor-train orthogonal iteration (TTOI), that aims to estimate the low tensor-train rank structure from the noisy high-order tensor observation. The proposed TTOI consists of initialization via TT-SVD (Oseledets, 2011) and new iterative backward/forward updates. We develop the g…
▽ More
This paper studies a general framework for high-order tensor SVD. We propose a new computationally efficient algorithm, tensor-train orthogonal iteration (TTOI), that aims to estimate the low tensor-train rank structure from the noisy high-order tensor observation. The proposed TTOI consists of initialization via TT-SVD (Oseledets, 2011) and new iterative backward/forward updates. We develop the general upper bound on estimation error for TTOI with the support of several new representation lemmas on tensor matricizations. By developing a matching information-theoretic lower bound, we also prove that TTOI achieves the minimax optimality under the spiked tensor model. The merits of the proposed TTOI are illustrated through applications to estimation and dimension reduction of high-order Markov processes, numerical studies, and a real data example on New York City taxi travel records. The software of the proposed algorithm is available online$^6$.
△ Less
Submitted 24 January, 2022; v1 submitted 6 October, 2020;
originally announced October 2020.
-
Open Problem: Average-Case Hardness of Hypergraphic Planted Clique Detection
Authors:
Yuetian Luo,
Anru R. Zhang
Abstract:
We note the significance of hypergraphic planted clique (HPC) detection in the investigation of computational hardness for a range of tensor problems. We ask if more evidence for the computational hardness of HPC detection can be developed. In particular, we conjecture if it is possible to establish the equivalence of the computational hardness between HPC and PC detection.
We note the significance of hypergraphic planted clique (HPC) detection in the investigation of computational hardness for a range of tensor problems. We ask if more evidence for the computational hardness of HPC detection can be developed. In particular, we conjecture if it is possible to establish the equivalence of the computational hardness between HPC and PC detection.
△ Less
Submitted 12 September, 2020;
originally announced September 2020.
-
A Sharp Blockwise Tensor Perturbation Bound for Orthogonal Iteration
Authors:
Yuetian Luo,
Garvesh Raskutti,
Ming Yuan,
Anru R. Zhang
Abstract:
In this paper, we develop novel perturbation bounds for the high-order orthogonal iteration (HOOI) [DLDMV00b]. Under mild regularity conditions, we establish blockwise tensor perturbation bounds for HOOI with guarantees for both tensor reconstruction in Hilbert-Schmidt norm $\|\widehat{\bcT} - \bcT \|_{\tHS}$ and mode-$k$ singular subspace estimation in Schatten-$q$ norm…
▽ More
In this paper, we develop novel perturbation bounds for the high-order orthogonal iteration (HOOI) [DLDMV00b]. Under mild regularity conditions, we establish blockwise tensor perturbation bounds for HOOI with guarantees for both tensor reconstruction in Hilbert-Schmidt norm $\|\widehat{\bcT} - \bcT \|_{\tHS}$ and mode-$k$ singular subspace estimation in Schatten-$q$ norm $\| \sin Θ(\widehat{\U}_k, \U_k) \|_q$ for any $q \geq 1$. We show the upper bounds of mode-$k$ singular subspace estimation are unilateral and converge linearly to a quantity characterized by blockwise errors of the perturbation and signal strength. For the tensor reconstruction error bound, we express the bound through a simple quantity $ξ$, which depends only on perturbation and the multilinear rank of the underlying signal. Rate matching deterministic lower bound for tensor reconstruction, which demonstrates the optimality of HOOI, is also provided. Furthermore, we prove that one-step HOOI (i.e., HOOI with only a single iteration) is also optimal in terms of tensor reconstruction and can be used to lower the computational cost. The perturbation results are also extended to the case that only partial modes of $\bcT$ have low-rank structure. We support our theoretical results by extensive numerical studies. Finally, we apply the novel perturbation bounds of HOOI on two applications, tensor denoising and tensor co-clustering, from machine learning and statistics, which demonstrates the superiority of the new perturbation results.
△ Less
Submitted 5 June, 2021; v1 submitted 5 August, 2020;
originally announced August 2020.
-
Tensor Clustering with Planted Structures: Statistical Optimality and Computational Limits
Authors:
Yuetian Luo,
Anru R. Zhang
Abstract:
This paper studies the statistical and computational limits of high-order clustering with planted structures. We focus on two clustering models, constant high-order clustering (CHC) and rank-one higher-order clustering (ROHC), and study the methods and theory for testing whether a cluster exists (detection) and identifying the support of cluster (recovery).
Specifically, we identify the sharp bo…
▽ More
This paper studies the statistical and computational limits of high-order clustering with planted structures. We focus on two clustering models, constant high-order clustering (CHC) and rank-one higher-order clustering (ROHC), and study the methods and theory for testing whether a cluster exists (detection) and identifying the support of cluster (recovery).
Specifically, we identify the sharp boundaries of signal-to-noise ratio for which CHC and ROHC detection/recovery are statistically possible. We also develop the tight computational thresholds: when the signal-to-noise ratio is below these thresholds, we prove that polynomial-time algorithms cannot solve these problems under the computational hardness conjectures of hypergraphic planted clique (HPC) detection and hypergraphic planted dense subgraph (HPDS) recovery. We also propose polynomial-time tensor algorithms that achieve reliable detection and recovery when the signal-to-noise ratio is above these thresholds. Both sparsity and tensor structures yield the computational barriers in high-order tensor clustering. The interplay between them results in significant differences between high-order tensor clustering and matrix clustering in literature in aspects of statistical and computational phase transition diagrams, algorithmic approaches, hardness conjecture, and proof techniques. To our best knowledge, we are the first to give a thorough characterization of the statistical and computational trade-off for such a double computational-barrier problem. Finally, we provide evidence for the computational hardness conjectures of HPC detection (via low-degree polynomial and Metropolis methods) and HPDS recovery (via low-degree polynomial method).
△ Less
Submitted 2 October, 2023; v1 submitted 21 May, 2020;
originally announced May 2020.
-
An Optimal Statistical and Computational Framework for Generalized Tensor Estimation
Authors:
Rungang Han,
Rebecca Willett,
Anru R. Zhang
Abstract:
This paper describes a flexible framework for generalized low-rank tensor estimation problems that includes many important instances arising from applications in computational imaging, genomics, and network analysis. The proposed estimator consists of finding a low-rank tensor fit to the data under generalized parametric models. To overcome the difficulty of non-convexity in these problems, we int…
▽ More
This paper describes a flexible framework for generalized low-rank tensor estimation problems that includes many important instances arising from applications in computational imaging, genomics, and network analysis. The proposed estimator consists of finding a low-rank tensor fit to the data under generalized parametric models. To overcome the difficulty of non-convexity in these problems, we introduce a unified approach of projected gradient descent that adapts to the underlying low-rank structure. Under mild conditions on the loss function, we establish both an upper bound on statistical error and the linear rate of computational convergence through a general deterministic analysis. Then we further consider a suite of generalized tensor estimation problems, including sub-Gaussian tensor PCA, tensor regression, and Poisson and binomial tensor PCA. We prove that the proposed algorithm achieves the minimax optimal rate of convergence in estimation error. Finally, we demonstrate the superiority of the proposed framework via extensive experiments on both simulated and real data.
△ Less
Submitted 4 February, 2021; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference
Authors:
T. Tony Cai,
Anru R. Zhang,
Yuchen Zhou
Abstract:
We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model -- an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are establishe…
▽ More
We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model -- an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.
△ Less
Submitted 6 May, 2022; v1 submitted 21 September, 2019;
originally announced September 2019.
-
High-dimensional Log-Error-in-Variable Regression with Applications to Microbial Compositional Data Analysis
Authors:
Pixu Shi,
Yuchen Zhou,
Anru R. Zhang
Abstract:
In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain criti…
▽ More
In microbiome and genomic studies, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. The merit of the procedure is illustrated through real data analysis and simulation studies.
△ Less
Submitted 10 March, 2021; v1 submitted 28 November, 2018;
originally announced November 2018.
-
Heteroskedastic PCA: Algorithm, Optimality, and Applications
Authors:
Anru R. Zhang,
T. Tony Cai,
Yihong Wu
Abstract:
A general framework for principal component analysis (PCA) in the presence of heteroskedastic noise is introduced. We propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries of the sample covariance matrix to remove estimation bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covaria…
▽ More
A general framework for principal component analysis (PCA) in the presence of heteroskedastic noise is introduced. We propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries of the sample covariance matrix to remove estimation bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covariance model. A key technical step is a deterministic robust perturbation analysis on singular subspaces, which can be of independent interest. The effectiveness of the proposed algorithm is demonstrated in a suite of problems in high-dimensional statistics, including singular value decomposition (SVD) under heteroskedastic noise, Poisson PCA, and SVD for heteroskedastic and incomplete data.
△ Less
Submitted 1 April, 2021; v1 submitted 18 October, 2018;
originally announced October 2018.
-
Nonparametric covariance estimation for mixed longitudinal studies, with applications in midlife women's health
Authors:
Anru R. Zhang,
Kehui Chen
Abstract:
In mixed longitudinal studies, a group of subjects enter the study at different ages (cross-sectional) and are followed for successive years (longitudinal). In the context of such studies, we consider nonparametric covariance estimation with samples of noisy and partially observed functional trajectories. The proposed algorithm is based on a noniterative sequential-aggregation scheme with only bas…
▽ More
In mixed longitudinal studies, a group of subjects enter the study at different ages (cross-sectional) and are followed for successive years (longitudinal). In the context of such studies, we consider nonparametric covariance estimation with samples of noisy and partially observed functional trajectories. The proposed algorithm is based on a noniterative sequential-aggregation scheme with only basic matrix operations and closed-form solutions in each step. The good performance of the proposed method is supported by both theory and numerical experiments. We also apply the proposed procedure to a study on the working memory of midlife women, based on data from the Study of Women's Health Across the Nation (SWAN).
△ Less
Submitted 1 December, 2020; v1 submitted 31 October, 2017;
originally announced November 2017.
-
Methods to Calculate the Upper Bound of Gini Coefficient Based on Grouped Data and the Result for China
Authors:
Pixu Shi,
Anru R. Zhang
Abstract:
Determining an upper bound, particularly the optimal upper bound of the Gini coefficient when dealing with grouped data without specified income brackets, remains an important and open question. In this paper, we introduce an efficient algorithm to calculate the exact optimal upper bound of the Gini coefficient with provable guarantees. To exemplify these methods, we also offer computed results fo…
▽ More
Determining an upper bound, particularly the optimal upper bound of the Gini coefficient when dealing with grouped data without specified income brackets, remains an important and open question. In this paper, we introduce an efficient algorithm to calculate the exact optimal upper bound of the Gini coefficient with provable guarantees. To exemplify these methods, we also offer computed results for the Gini coefficients of urban and rural China spanning the years 2003 to 2008.
△ Less
Submitted 14 January, 2025; v1 submitted 21 May, 2013;
originally announced May 2013.