-
Out-of-Distribution Generalization of In-Context Learning: A Low-Dimensional Subspace Perspective
Authors:
Soo Min Kwon,
Alec S. Xu,
Can Yaras,
Laura Balzano,
Qing Qu
Abstract:
This work aims to demystify the out-of-distribution (OOD) capabilities of in-context learning (ICL) by studying linear regression tasks parameterized with low-rank covariance matrices. With such a parameterization, we can model distribution shifts as a varying angle between the subspace of the training and testing covariance matrices. We prove that a single-layer linear attention model incurs a te…
▽ More
This work aims to demystify the out-of-distribution (OOD) capabilities of in-context learning (ICL) by studying linear regression tasks parameterized with low-rank covariance matrices. With such a parameterization, we can model distribution shifts as a varying angle between the subspace of the training and testing covariance matrices. We prove that a single-layer linear attention model incurs a test risk with a non-negligible dependence on the angle, illustrating that ICL is not robust to such distribution shifts. However, using this framework, we also prove an interesting property of ICL: when trained on task vectors drawn from a union of low-dimensional subspaces, ICL can generalize to any subspace within their span, given sufficiently long prompt lengths. This suggests that the OOD generalization ability of Transformers may actually stem from the new task lying within the span of those encountered during training. We empirically show that our results also hold for models such as GPT-2, and conclude with (i) experiments on how our observations extend to nonlinear function classes and (ii) results on how LoRA has the ability to capture distribution shifts.
△ Less
Submitted 20 May, 2025;
originally announced May 2025.
-
Truncated Matrix Completion - An Empirical Study
Authors:
Rishhabh Naik,
Nisarg Trivedi,
Davoud Ataee Tarzanagh,
Laura Balzano
Abstract:
Low-rank Matrix Completion (LRMC) describes the problem where we wish to recover missing entries of partially observed low-rank matrix. Most existing matrix completion work deals with sampling procedures that are independent of the underlying data values. While this assumption allows the derivation of nice theoretical guarantees, it seldom holds in real-world applications. In this paper, we consid…
▽ More
Low-rank Matrix Completion (LRMC) describes the problem where we wish to recover missing entries of partially observed low-rank matrix. Most existing matrix completion work deals with sampling procedures that are independent of the underlying data values. While this assumption allows the derivation of nice theoretical guarantees, it seldom holds in real-world applications. In this paper, we consider various settings where the sampling mask is dependent on the underlying data values, motivated by applications in sensing, sequential decision-making, and recommender systems. Through a series of experiments, we study and compare the performance of various LRMC algorithms that were originally successful for data-independent sampling patterns.
△ Less
Submitted 14 April, 2025;
originally announced April 2025.
-
An Overview of Low-Rank Structures in the Training and Adaptation of Large Models
Authors:
Laura Balzano,
Tianjiao Ding,
Benjamin D. Haeffele,
Soo Min Kwon,
Qing Qu,
Peng Wang,
Zhangyang Wang,
Can Yaras
Abstract:
The rise of deep learning has revolutionized data processing and prediction in signal processing and machine learning, yet the substantial computational demands of training and deploying modern large-scale deep models present significant challenges, including high computational costs and energy consumption. Recent research has uncovered a widespread phenomenon in deep networks: the emergence of lo…
▽ More
The rise of deep learning has revolutionized data processing and prediction in signal processing and machine learning, yet the substantial computational demands of training and deploying modern large-scale deep models present significant challenges, including high computational costs and energy consumption. Recent research has uncovered a widespread phenomenon in deep networks: the emergence of low-rank structures in weight matrices and learned representations during training. These implicit low-dimensional patterns provide valuable insights for improving the efficiency of training and fine-tuning large-scale models. Practical techniques inspired by this phenomenon, such as low-rank adaptation (LoRA) and training, enable significant reductions in computational cost while preserving model performance. In this paper, we present a comprehensive review of recent advances in exploiting low-rank structures for deep learning and shed light on their mathematical foundations. Mathematically, we present two complementary perspectives on understanding the low-rankness in deep networks: (i) the emergence of low-rank structures throughout the whole optimization dynamics of gradient and (ii) the implicit regularization effects that induce such low-rank structures at convergence. From a practical standpoint, studying the low-rank learning dynamics of gradient descent offers a mathematical foundation for understanding the effectiveness of LoRA in fine-tuning large-scale models and inspires parameter-efficient low-rank training strategies. Furthermore, the implicit low-rank regularization effect helps explain the success of various masked training approaches in deep neural networks, ranging from dropout to masked self-supervised learning.
△ Less
Submitted 25 March, 2025;
originally announced March 2025.
-
Convergence and Complexity Guarantee for Inexact First-order Riemannian Optimization Algorithms
Authors:
Yuchen Li,
Laura Balzano,
Deanna Needell,
Hanbaek Lyu
Abstract:
We analyze inexact Riemannian gradient descent (RGD) where Riemannian gradients and retractions are inexactly (and cheaply) computed. Our focus is on understanding when inexact RGD converges and what is the complexity in the general nonconvex and constrained setting. We answer these questions in a general framework of tangential Block Majorization-Minimization (tBMM). We establish that tBMM conver…
▽ More
We analyze inexact Riemannian gradient descent (RGD) where Riemannian gradients and retractions are inexactly (and cheaply) computed. Our focus is on understanding when inexact RGD converges and what is the complexity in the general nonconvex and constrained setting. We answer these questions in a general framework of tangential Block Majorization-Minimization (tBMM). We establish that tBMM converges to an $ε$-stationary point within $O(ε^{-2})$ iterations. Under a mild assumption, the results still hold when the subproblem is solved inexactly in each iteration provided the total optimality gap is bounded. Our general analysis applies to a wide range of classical algorithms with Riemannian constraints including inexact RGD and proximal gradient method on Stiefel manifolds. We numerically validate that tBMM shows improved performance over existing methods when applied to various problems, including nonnegative tensor decomposition with Riemannian constraints, regularized nonnegative matrix factorization, and low-rank matrix recovery problems.
△ Less
Submitted 9 May, 2024; v1 submitted 5 May, 2024;
originally announced May 2024.
-
Convergence and complexity of block majorization-minimization for constrained block-Riemannian optimization
Authors:
Yuchen Li,
Laura Balzano,
Deanna Needell,
Hanbaek Lyu
Abstract:
Block majorization-minimization (BMM) is a simple iterative algorithm for nonconvex optimization that sequentially minimizes a majorizing surrogate of the objective function in each block coordinate while the other block coordinates are held fixed. We consider a family of BMM algorithms for minimizing smooth nonconvex objectives, where each parameter block is constrained within a subset of a Riema…
▽ More
Block majorization-minimization (BMM) is a simple iterative algorithm for nonconvex optimization that sequentially minimizes a majorizing surrogate of the objective function in each block coordinate while the other block coordinates are held fixed. We consider a family of BMM algorithms for minimizing smooth nonconvex objectives, where each parameter block is constrained within a subset of a Riemannian manifold. We establish that this algorithm converges asymptotically to the set of stationary points, and attains an $ε$-stationary point within $\widetilde{O}(ε^{-2})$ iterations. In particular, the assumptions for our complexity results are completely Euclidean when the underlying manifold is a product of Euclidean or Stiefel manifolds, although our analysis makes explicit use of the Riemannian geometry. Our general analysis applies to a wide range of algorithms with Riemannian constraints: Riemannian MM, block projected gradient descent, optimistic likelihood estimation, geodesically constrained subspace tracking, robust PCA, and Riemannian CP-dictionary-learning. We experimentally validate that our algorithm converges faster than standard Euclidean algorithms applied to the Riemannian setting.
△ Less
Submitted 6 August, 2024; v1 submitted 16 December, 2023;
originally announced December 2023.
-
Understanding Deep Representation Learning via Layerwise Feature Compression and Discrimination
Authors:
Peng Wang,
Xiao Li,
Can Yaras,
Zhihui Zhu,
Laura Balzano,
Wei Hu,
Qing Qu
Abstract:
Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic…
▽ More
Over the past decade, deep learning has proven to be a highly effective tool for learning meaningful features from raw data. However, it remains an open question how deep networks perform hierarchical feature learning across layers. In this work, we attempt to unveil this mystery by investigating the structures of intermediate features. Motivated by our empirical findings that linear layers mimic the roles of deep layers in nonlinear networks for feature learning, we explore how deep linear networks transform input data into output by investigating the output (i.e., features) of each layer after training in the context of multi-class classification problems. Toward this goal, we first define metrics to measure within-class compression and between-class discrimination of intermediate features, respectively. Through theoretical analysis of these two metrics, we show that the evolution of features follows a simple and quantitative pattern from shallow to deep layers when the input data is nearly orthogonal and the network weights are minimum-norm, balanced, and approximate low-rank: Each layer of the linear network progressively compresses within-class features at a geometric rate and discriminates between-class features at a linear rate with respect to the number of layers that data have passed through. To the best of our knowledge, this is the first quantitative characterization of feature evolution in hierarchical representations of deep linear networks. Empirically, our extensive experiments not only validate our theoretical results numerically but also reveal a similar pattern in deep nonlinear networks which aligns well with recent empirical studies. Moreover, we demonstrate the practical implications of our results in transfer learning. Our code is available at https://github.com/Heimine/PNC_DLN.
△ Less
Submitted 21 May, 2025; v1 submitted 6 November, 2023;
originally announced November 2023.
-
Learning physics-based reduced-order models from data using nonlinear manifolds
Authors:
Rudy Geelen,
Laura Balzano,
Stephen Wright,
Karen Willcox
Abstract:
We present a novel method for learning reduced-order models of dynamical systems using nonlinear manifolds. First, we learn the manifold by identifying nonlinear structure in the data through a general representation learning problem. The proposed approach is driven by embeddings of low-order polynomial form. A projection onto the nonlinear manifold reveals the algebraic structure of the reduced-s…
▽ More
We present a novel method for learning reduced-order models of dynamical systems using nonlinear manifolds. First, we learn the manifold by identifying nonlinear structure in the data through a general representation learning problem. The proposed approach is driven by embeddings of low-order polynomial form. A projection onto the nonlinear manifold reveals the algebraic structure of the reduced-space system that governs the problem of interest. The matrix operators of the reduced-order model are then inferred from the data using operator inference. Numerical experiments on a number of nonlinear problems demonstrate the generalizability of the methodology and the increase in accuracy that can be obtained over reduced-order modeling methods that employ a linear subspace approximation.
△ Less
Submitted 19 February, 2024; v1 submitted 5 August, 2023;
originally announced August 2023.
-
Learning latent representations in high-dimensional state spaces using polynomial manifold constructions
Authors:
Rudy Geelen,
Laura Balzano,
Karen Willcox
Abstract:
We present a novel framework for learning cost-efficient latent representations in problems with high-dimensional state spaces through nonlinear dimension reduction. By enriching linear state approximations with low-order polynomial terms we account for key nonlinear interactions existing in the data thereby reducing the problem's intrinsic dimensionality. Two methods are introduced for learning t…
▽ More
We present a novel framework for learning cost-efficient latent representations in problems with high-dimensional state spaces through nonlinear dimension reduction. By enriching linear state approximations with low-order polynomial terms we account for key nonlinear interactions existing in the data thereby reducing the problem's intrinsic dimensionality. Two methods are introduced for learning the representation of such low-dimensional, polynomial manifolds for embedding the data. The manifold parametrization coefficients can be obtained by regression via either a proper orthogonal decomposition or an alternating minimization based approach. Our numerical results focus on the one-dimensional Korteweg-de Vries equation where accounting for nonlinear correlations in the data was found to lower the representation error by up to two orders of magnitude compared to linear dimension reduction techniques.
△ Less
Submitted 23 June, 2023;
originally announced June 2023.
-
A Proximal DC Algorithm for Sample Average Approximation of Chance Constrained Programming
Authors:
Peng Wang,
Rujun Jiang,
Qingyuan Kong,
Laura Balzano
Abstract:
Chance constrained programming (CCP) refers to a type of optimization problem with uncertain constraints that are satisfied with at least a prescribed probability level. In this work, we study the sample average approximation (SAA) of chance constraints. This is an important approach to solving CCP, especially in the data-driven setting where only a sample of multiple realizations of the random ve…
▽ More
Chance constrained programming (CCP) refers to a type of optimization problem with uncertain constraints that are satisfied with at least a prescribed probability level. In this work, we study the sample average approximation (SAA) of chance constraints. This is an important approach to solving CCP, especially in the data-driven setting where only a sample of multiple realizations of the random vector in the chance constraints is available. The SAA is obtained by replacing the underlying distribution with an empirical distribution over the available sample. Assuming that the functions in chance constraints are all convex, we reformulate the SAA of chance constraints into a difference-of-convex (DC) form. Moreover, considering that the objective function is a difference-of-convex function, the resulting formulation becomes a DC constrained DC program. Then, we propose a proximal DC algorithm for solving this reformulation. In particular, we show that the subproblems of the proximal DC are suitable for off-the-shelf solvers in some scenarios. Moreover, we not only prove the subsequential and sequential convergence of the proposed algorithm but also derive the iteration complexity for finding an approximate Karush-Kuhn-Tucker point. To support and complement our theoretical development, we show via numerical experiments that our proposed approach is competitive with a host of existing approaches.
△ Less
Submitted 28 April, 2025; v1 submitted 1 January, 2023;
originally announced January 2023.
-
Online Bilevel Optimization: Regret Analysis of Online Alternating Gradient Methods
Authors:
Davoud Ataee Tarzanagh,
Parvin Nazari,
Bojian Hou,
Li Shen,
Laura Balzano
Abstract:
This paper introduces \textit{online bilevel optimization} in which a sequence of time-varying bilevel problems is revealed one after the other. We extend the known regret bounds for online single-level algorithms to the bilevel setting. Specifically, we provide new notions of \textit{bilevel regret}, develop an online alternating time-averaged gradient method that is capable of leveraging smoothn…
▽ More
This paper introduces \textit{online bilevel optimization} in which a sequence of time-varying bilevel problems is revealed one after the other. We extend the known regret bounds for online single-level algorithms to the bilevel setting. Specifically, we provide new notions of \textit{bilevel regret}, develop an online alternating time-averaged gradient method that is capable of leveraging smoothness, and give regret bounds in terms of the path-length of the inner and outer minimizer sequences.
△ Less
Submitted 8 July, 2024; v1 submitted 6 July, 2022;
originally announced July 2022.
-
Convergence and Recovery Guarantees of the K-Subspaces Method for Subspace Clustering
Authors:
Peng Wang,
Huikang Liu,
Anthony Man-Cho So,
Laura Balzano
Abstract:
The K-subspaces (KSS) method is a generalization of the K-means method for subspace clustering. In this work, we present local convergence analysis and a recovery guarantee for KSS, assuming data are generated by the semi-random union of subspaces model, where $N$ points are randomly sampled from $K \ge 2$ overlapping subspaces. We show that if the initial assignment of the KSS method lies within…
▽ More
The K-subspaces (KSS) method is a generalization of the K-means method for subspace clustering. In this work, we present local convergence analysis and a recovery guarantee for KSS, assuming data are generated by the semi-random union of subspaces model, where $N$ points are randomly sampled from $K \ge 2$ overlapping subspaces. We show that if the initial assignment of the KSS method lies within a neighborhood of a true clustering, it converges at a superlinear rate and finds the correct clustering within $Θ(\log\log N)$ iterations with high probability. Moreover, we propose a thresholding inner-product based spectral method for initialization and prove that it produces a point in this neighborhood. We also present numerical results of the studied method to support our theoretical developments.
△ Less
Submitted 18 June, 2022; v1 submitted 11 June, 2022;
originally announced June 2022.
-
A Semidefinite Relaxation for Sums of Heterogeneous Quadratic Forms on the Stiefel Manifold
Authors:
Kyle Gilman,
Sam Burer,
Laura Balzano
Abstract:
We study the maximization of sums of heterogeneous quadratic forms over the Stiefel manifold, a nonconvex problem that arises in several modern signal processing and machine learning applications such as heteroscedastic probabilistic principal component analysis (HPPCA). In this work, we derive a novel semidefinite program (SDP) relaxation of the original problem and study a few of its theoretical…
▽ More
We study the maximization of sums of heterogeneous quadratic forms over the Stiefel manifold, a nonconvex problem that arises in several modern signal processing and machine learning applications such as heteroscedastic probabilistic principal component analysis (HPPCA). In this work, we derive a novel semidefinite program (SDP) relaxation of the original problem and study a few of its theoretical properties. We prove a global optimality certificate for the original nonconvex problem via a dual certificate, which leads to a simple feasibility problem to certify global optimality of a candidate solution on the Stiefel manifold. In addition, our relaxation reduces to an assignment linear program for jointly diagonalizable problems and is therefore known to be tight in that case. We generalize this result to show that it is also tight for close-to jointly diagonalizable problems, and we show that the HPPCA problem has this characteristic. Numerical results validate our global optimality certificate and sufficient conditions for when the SDP is tight in various problem settings.
△ Less
Submitted 7 April, 2025; v1 submitted 26 May, 2022;
originally announced May 2022.
-
Identification and Adaptive Control of Markov Jump Systems: Sample Complexity and Regret Bounds
Authors:
Yahya Sattar,
Zhe Du,
Davoud Ataee Tarzanagh,
Laura Balzano,
Necmiye Ozay,
Samet Oymak
Abstract:
Learning how to effectively control unknown dynamical systems is crucial for intelligent autonomous systems. This task becomes a significant challenge when the underlying dynamics are changing with time. Motivated by this challenge, this paper considers the problem of controlling an unknown Markov jump linear system (MJS) to optimize a quadratic objective. By taking a model-based perspective, we c…
▽ More
Learning how to effectively control unknown dynamical systems is crucial for intelligent autonomous systems. This task becomes a significant challenge when the underlying dynamics are changing with time. Motivated by this challenge, this paper considers the problem of controlling an unknown Markov jump linear system (MJS) to optimize a quadratic objective. By taking a model-based perspective, we consider identification-based adaptive control for MJSs. We first provide a system identification algorithm for MJS to learn the dynamics in each mode as well as the Markov transition matrix, underlying the evolution of the mode switches, from a single trajectory of the system states, inputs, and modes. Through mixing-time arguments, sample complexity of this algorithm is shown to be $\mathcal{O}(1/\sqrt{T})$. We then propose an adaptive control scheme that performs system identification together with certainty equivalent control to adapt the controllers in an episodic fashion. Combining our sample complexity results with recent perturbation results for certainty equivalent control, we prove that when the episode lengths are appropriately chosen, the proposed adaptive control scheme achieves $\mathcal{O}(\sqrt{T})$ regret, which can be improved to $\mathcal{O}(polylog(T))$ with partial knowledge of the system. Our proof strategy introduces innovations to handle Markovian jumps and a weaker notion of stability common in MJSs. Our analysis provides insights into system theoretic quantities that affect learning accuracy and control performance. Numerical simulations are presented to further reinforce these insights.
△ Less
Submitted 12 November, 2021;
originally announced November 2021.
-
Certainty Equivalent Quadratic Control for Markov Jump Systems
Authors:
Zhe Du,
Yahya Sattar,
Davoud Ataee Tarzanagh,
Laura Balzano,
Samet Oymak,
Necmiye Ozay
Abstract:
Real-world control applications often involve complex dynamics subject to abrupt changes or variations. Markov jump linear systems (MJS) provide a rich framework for modeling such dynamics. Despite an extensive history, theoretical understanding of parameter sensitivities of MJS control is somewhat lacking. Motivated by this, we investigate robustness aspects of certainty equivalent model-based op…
▽ More
Real-world control applications often involve complex dynamics subject to abrupt changes or variations. Markov jump linear systems (MJS) provide a rich framework for modeling such dynamics. Despite an extensive history, theoretical understanding of parameter sensitivities of MJS control is somewhat lacking. Motivated by this, we investigate robustness aspects of certainty equivalent model-based optimal control for MJS with quadratic cost function. Given the uncertainty in the system matrices and in the Markov transition matrix is bounded by $ε$ and $η$ respectively, robustness results are established for (i) the solution to coupled Riccati equations and (ii) the optimal cost, by providing explicit perturbation bounds which decay as $\mathcal{O}(ε+ η)$ and $\mathcal{O}((ε+ η)^2)$ respectively.
△ Less
Submitted 26 May, 2021;
originally announced May 2021.
-
HePPCAT: Probabilistic PCA for Data with Heteroscedastic Noise
Authors:
David Hong,
Kyle Gilman,
Laura Balzano,
Jeffrey A. Fessler
Abstract:
Principal component analysis (PCA) is a classical and ubiquitous method for reducing data dimensionality, but it is suboptimal for heterogeneous data that are increasingly common in modern applications. PCA treats all samples uniformly so degrades when the noise is heteroscedastic across samples, as occurs, e.g., when samples come from sources of heterogeneous quality. This paper develops a probab…
▽ More
Principal component analysis (PCA) is a classical and ubiquitous method for reducing data dimensionality, but it is suboptimal for heterogeneous data that are increasingly common in modern applications. PCA treats all samples uniformly so degrades when the noise is heteroscedastic across samples, as occurs, e.g., when samples come from sources of heterogeneous quality. This paper develops a probabilistic PCA variant that estimates and accounts for this heterogeneity by incorporating it in the statistical model. Unlike in the homoscedastic setting, the resulting nonconvex optimization problem is not seemingly solved by singular value decomposition. This paper develops a heteroscedastic probabilistic PCA technique (HePPCAT) that uses efficient alternating maximization algorithms to jointly estimate both the underlying factors and the unknown noise variances. Simulation experiments illustrate the comparative speed of the algorithms, the benefit of accounting for heteroscedasticity, and the seemingly favorable optimization landscape of this problem. Real data experiments on environmental air quality data show that HePPCAT can give a better PCA estimate than techniques that do not account for heteroscedasticity.
△ Less
Submitted 1 December, 2021; v1 submitted 9 January, 2021;
originally announced January 2021.
-
Online matrix factorization for Markovian data and applications to Network Dictionary Learning
Authors:
Hanbaek Lyu,
Deanna Needell,
Laura Balzano
Abstract:
Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this paper, we…
▽ More
Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this paper, we show that a non-convex generalization of the well-known OMF algorithm for i.i.d. stream of data in \citep{mairal2010online} converges almost surely to the set of critical points of the expected loss function, even when the data matrices are functions of some underlying Markov chain satisfying a mild mixing condition. This allows one to extract features more efficiently from dependent data streams, as there is no need to subsample the data sequence to approximately satisfy the independence assumption. As the main application, by combining online non-negative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning, which extracts ``network dictionary patches' from a given network in an online manner that encodes main features of the network. We demonstrate this technique and its application to network denoising problems on real-world network data.
△ Less
Submitted 7 November, 2020; v1 submitted 5 November, 2019;
originally announced November 2019.
-
A Memory-efficient Algorithm for Large-scale Sparsity Regularized Image Reconstruction
Authors:
Greg Ongie,
Naveen Murthy,
Laura Balzano,
Jeffrey A. Fessler
Abstract:
We derive a memory-efficient first-order variable splitting algorithm for convex image reconstruction problems with non-smooth regularization terms. The algorithm is based on a primal-dual approach, where one of the dual variables is updated using a step of the Frank-Wolfe algorithm, rather than the typical proximal point step used in other primal-dual algorithms. We show in certain cases this res…
▽ More
We derive a memory-efficient first-order variable splitting algorithm for convex image reconstruction problems with non-smooth regularization terms. The algorithm is based on a primal-dual approach, where one of the dual variables is updated using a step of the Frank-Wolfe algorithm, rather than the typical proximal point step used in other primal-dual algorithms. We show in certain cases this results in an algorithm with far less memory demand than other first-order methods based on proximal mappings. We demonstrate the algorithm on the problem of sparse-view X-ray computed tomography (CT) reconstruction with non-smooth edge-preserving regularization and show competitive run-time with other state-of-the-art algorithms while using much less memory.
△ Less
Submitted 31 March, 2019;
originally announced April 2019.
-
Optimally Weighted PCA for High-Dimensional Heteroscedastic Data
Authors:
David Hong,
Fan Yang,
Jeffrey A. Fessler,
Laura Balzano
Abstract:
Modern data are increasingly both high-dimensional and heteroscedastic. This paper considers the challenge of estimating underlying principal components from high-dimensional data with noise that is heteroscedastic across samples, i.e., some samples are noisier than others. Such heteroscedasticity naturally arises, e.g., when combining data from diverse sources or sensors. A natural way to account…
▽ More
Modern data are increasingly both high-dimensional and heteroscedastic. This paper considers the challenge of estimating underlying principal components from high-dimensional data with noise that is heteroscedastic across samples, i.e., some samples are noisier than others. Such heteroscedasticity naturally arises, e.g., when combining data from diverse sources or sensors. A natural way to account for this heteroscedasticity is to give noisier blocks of samples less weight in PCA by using the leading eigenvectors of a weighted sample covariance matrix. We consider the problem of choosing weights to optimally recover the underlying components. In general, one cannot know these optimal weights since they depend on the underlying components we seek to estimate. However, we show that under some natural statistical assumptions the optimal weights converge to a simple function of the signal and noise variances for high-dimensional data. Surprisingly, the optimal weights are not the inverse noise variance weights commonly used in practice. We demonstrate the theoretical results through numerical simulations and comparisons with existing weighting schemes. Finally, we briefly discuss how estimated signal and noise variances can be used when the true variances are unknown, and we illustrate the optimal weights on real data from astronomy.
△ Less
Submitted 13 September, 2022; v1 submitted 30 October, 2018;
originally announced October 2018.
-
Asymptotic performance of PCA for high-dimensional heteroscedastic data
Authors:
David Hong,
Laura Balzano,
Jeffrey A. Fessler
Abstract:
Principal Component Analysis (PCA) is a classical method for reducing the dimensionality of data by projecting them onto a subspace that captures most of their variation. Effective use of PCA in modern applications requires understanding its performance for data that are both high-dimensional and heteroscedastic. This paper analyzes the statistical performance of PCA in this setting, i.e., for hig…
▽ More
Principal Component Analysis (PCA) is a classical method for reducing the dimensionality of data by projecting them onto a subspace that captures most of their variation. Effective use of PCA in modern applications requires understanding its performance for data that are both high-dimensional and heteroscedastic. This paper analyzes the statistical performance of PCA in this setting, i.e., for high-dimensional data drawn from a low-dimensional subspace and degraded by heteroscedastic noise. We provide simplified expressions for the asymptotic PCA recovery of the underlying subspace, subspace amplitudes and subspace coefficients; the expressions enable both easy and efficient calculation and reasoning about the performance of PCA. We exploit the structure of these expressions to show that, for a fixed average noise variance, the asymptotic recovery of PCA for heteroscedastic data is always worse than that for homoscedastic data (i.e., for noise variances that are equal across samples). Hence, while average noise variance is often a practically convenient measure for the overall quality of data, it gives an overly optimistic estimate of the performance of PCA for heteroscedastic data.
△ Less
Submitted 23 June, 2018; v1 submitted 20 March, 2017;
originally announced March 2017.
-
Real-Time Energy Disaggregation of a Distribution Feeder's Demand Using Online Learning
Authors:
Gregory S. Ledva,
Laura Balzano,
Johanna L. Mathieu
Abstract:
Though distribution system operators have been adding more sensors to their networks, they still often lack an accurate real-time picture of the behavior of distributed energy resources such as demand responsive electric loads and residential solar generation. Such information could improve system reliability, economic efficiency, and environmental impact. Rather than installing additional, costly…
▽ More
Though distribution system operators have been adding more sensors to their networks, they still often lack an accurate real-time picture of the behavior of distributed energy resources such as demand responsive electric loads and residential solar generation. Such information could improve system reliability, economic efficiency, and environmental impact. Rather than installing additional, costly sensing and communication infrastructure to obtain additional real-time information, it may be possible to use existing sensing capabilities and leverage knowledge about the system to reduce the need for new infrastructure. In this paper, we disaggregate a distribution feeder's demand measurements into: 1) the demand of a population of air conditioners, and 2) the demand of the remaining loads connected to the feeder. We use an online learning algorithm, Dynamic Fixed Share (DFS), that uses the real-time distribution feeder measurements as well as models generated from historical building- and device-level data. We develop two implementations of the algorithm and conduct case studies using real demand data from households and commercial buildings to investigate the effectiveness of the algorithm. The case studies demonstrate that DFS can effectively perform online disaggregation and the choice and construction of models included in the algorithm affects its accuracy, which is comparable to that of a set of Kalman filters.
△ Less
Submitted 4 May, 2018; v1 submitted 16 January, 2017;
originally announced January 2017.
-
Towards a Theoretical Analysis of PCA for Heteroscedastic Data
Authors:
David Hong,
Laura Balzano,
Jeffrey A. Fessler
Abstract:
Principal Component Analysis (PCA) is a method for estimating a subspace given noisy samples. It is useful in a variety of problems ranging from dimensionality reduction to anomaly detection and the visualization of high dimensional data. PCA performs well in the presence of moderate noise and even with missing data, but is also sensitive to outliers. PCA is also known to have a phase transition w…
▽ More
Principal Component Analysis (PCA) is a method for estimating a subspace given noisy samples. It is useful in a variety of problems ranging from dimensionality reduction to anomaly detection and the visualization of high dimensional data. PCA performs well in the presence of moderate noise and even with missing data, but is also sensitive to outliers. PCA is also known to have a phase transition when noise is independent and identically distributed; recovery of the subspace sharply declines at a threshold noise variance. Effective use of PCA requires a rigorous understanding of these behaviors. This paper provides a step towards an analysis of PCA for samples with heteroscedastic noise, that is, samples that have non-uniform noise variances and so are no longer identically distributed. In particular, we provide a simple asymptotic prediction of the recovery of a one-dimensional subspace from noisy heteroscedastic samples. The prediction enables: a) easy and efficient calculation of the asymptotic performance, and b) qualitative reasoning to understand how PCA is impacted by heteroscedasticity (such as outliers).
△ Less
Submitted 12 October, 2016;
originally announced October 2016.
-
Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation From Undersampled Data
Authors:
Dejiao Zhang,
Laura Balzano
Abstract:
Subspace learning and matrix factorization problems have great many applications in science and engineering, and efficient algorithms are critical as dataset sizes continue to grow. Many relevant problem formulations are non-convex, and in a variety of contexts it has been observed that solving the non-convex problem directly is not only efficient but reliably accurate. We discuss convergence theo…
▽ More
Subspace learning and matrix factorization problems have great many applications in science and engineering, and efficient algorithms are critical as dataset sizes continue to grow. Many relevant problem formulations are non-convex, and in a variety of contexts it has been observed that solving the non-convex problem directly is not only efficient but reliably accurate. We discuss convergence theory for a particular method: first order incremental gradient descent constrained to the Grassmannian. The output of the algorithm is an orthonormal basis for a $d$-dimensional subspace spanned by an input streaming data matrix. We study two sampling cases: where each data vector of the streaming matrix is fully sampled, or where it is undersampled by a sampling matrix $A_t\in \mathbb{R}^{m\times n}$ with $m\ll n$. Our results cover two cases, where $A_t$ is Gaussian or a subset of rows of the identity matrix. We propose an adaptive stepsize scheme that depends only on the sampled data and algorithm outputs. We prove that with fully sampled data, the stepsize scheme maximizes the improvement of our convergence metric at each iteration, and this method converges from any random initialization to the true subspace, despite the non-convex formulation and orthogonality constraints. For the case of undersampled data, we establish monotonic expected improvement on the defined convergence metric for each iteration with high probability.
△ Less
Submitted 20 February, 2022; v1 submitted 1 October, 2016;
originally announced October 2016.
-
Global Convergence of a Grassmannian Gradient Descent Algorithm for Subspace Estimation
Authors:
Dejiao Zhang,
Laura Balzano
Abstract:
It has been observed in a variety of contexts that gradient descent methods have great success in solving low-rank matrix factorization problems, despite the relevant problem formulation being non-convex. We tackle a particular instance of this scenario, where we seek the $d$-dimensional subspace spanned by a streaming data matrix. We apply the natural first order incremental gradient descent meth…
▽ More
It has been observed in a variety of contexts that gradient descent methods have great success in solving low-rank matrix factorization problems, despite the relevant problem formulation being non-convex. We tackle a particular instance of this scenario, where we seek the $d$-dimensional subspace spanned by a streaming data matrix. We apply the natural first order incremental gradient descent method, constraining the gradient method to the Grassmannian. In this paper, we propose an adaptive step size scheme that is greedy for the noiseless case, that maximizes the improvement of our metric of convergence at each data index $t$, and yields an expected improvement for the noisy case. We show that, with noise-free data, this method converges from any random initialization to the global minimum of the problem. For noisy data, we provide the expected convergence rate of the proposed algorithm per iteration.
△ Less
Submitted 24 June, 2016; v1 submitted 24 June, 2015;
originally announced June 2015.
-
On GROUSE and Incremental SVD
Authors:
Laura Balzano,
Stephen J. Wright
Abstract:
GROUSE (Grassmannian Rank-One Update Subspace Estimation) is an incremental algorithm for identifying a subspace of Rn from a sequence of vectors in this subspace, where only a subset of components of each vector is revealed at each iteration. Recent analysis has shown that GROUSE converges locally at an expected linear rate, under certain assumptions. GROUSE has a similar flavor to the incrementa…
▽ More
GROUSE (Grassmannian Rank-One Update Subspace Estimation) is an incremental algorithm for identifying a subspace of Rn from a sequence of vectors in this subspace, where only a subset of components of each vector is revealed at each iteration. Recent analysis has shown that GROUSE converges locally at an expected linear rate, under certain assumptions. GROUSE has a similar flavor to the incremental singular value decomposition algorithm, which updates the SVD of a matrix following addition of a single column. In this paper, we modify the incremental SVD approach to handle missing data, and demonstrate that this modified approach is equivalent to GROUSE, for a certain choice of an algorithmic parameter.
△ Less
Submitted 20 July, 2013;
originally announced July 2013.
-
Local Convergence of an Algorithm for Subspace Identification from Partial Data
Authors:
Laura Balzano,
Stephen J. Wright
Abstract:
GROUSE (Grassmannian Rank-One Update Subspace Estimation) is an iterative algorithm for identifying a linear subspace of R^n from data consisting of partial observations of random vectors from that subspace. This paper examines local convergence properties of GROUSE, under assumptions on the randomness of the observed vectors, the randomness of the subset of elements observed at each iteration, an…
▽ More
GROUSE (Grassmannian Rank-One Update Subspace Estimation) is an iterative algorithm for identifying a linear subspace of R^n from data consisting of partial observations of random vectors from that subspace. This paper examines local convergence properties of GROUSE, under assumptions on the randomness of the observed vectors, the randomness of the subset of elements observed at each iteration, and incoherence of the subspace with the coordinate directions. Convergence at an expected linear rate is demonstrated under certain assumptions. The case in which the full random vector is revealed at each iteration allows for much simpler analysis, and is also described. GROUSE is related to incremental SVD methods and to gradient projection algorithms in optimization.
△ Less
Submitted 1 July, 2014; v1 submitted 14 June, 2013;
originally announced June 2013.
-
Iterative Grassmannian Optimization for Robust Image Alignment
Authors:
Jun He,
Dejiao Zhang,
Laura Balzano,
Tao Tao
Abstract:
Robust high-dimensional data processing has witnessed an exciting development in recent years, as theoretical results have shown that it is possible using convex programming to optimize data fit to a low-rank component plus a sparse outlier component. This problem is also known as Robust PCA, and it has found application in many areas of computer vision. In image and video processing and face reco…
▽ More
Robust high-dimensional data processing has witnessed an exciting development in recent years, as theoretical results have shown that it is possible using convex programming to optimize data fit to a low-rank component plus a sparse outlier component. This problem is also known as Robust PCA, and it has found application in many areas of computer vision. In image and video processing and face recognition, the opportunity to process massive image databases is emerging as people upload photo and video data online in unprecedented volumes. However, data quality and consistency is not controlled in any way, and the massiveness of the data poses a serious computational challenge. In this paper we present t-GRASTA, or "Transformed GRASTA (Grassmannian Robust Adaptive Subspace Tracking Algorithm)". t-GRASTA iteratively performs incremental gradient descent constrained to the Grassmann manifold of subspaces in order to simultaneously estimate a decomposition of a collection of images into a low-rank subspace, a sparse part of occlusions and foreground objects, and a transformation such as rotation or translation of the image. We show that t-GRASTA is 4 $\times$ faster than state-of-the-art algorithms, has half the memory requirement, and can achieve alignment for face images as well as jittered camera surveillance images.
△ Less
Submitted 20 June, 2013; v1 submitted 3 June, 2013;
originally announced June 2013.
-
Online Robust Subspace Tracking from Partial Information
Authors:
Jun He,
Laura Balzano,
John C. S. Lui
Abstract:
This paper presents GRASTA (Grassmannian Robust Adaptive Subspace Tracking Algorithm), an efficient and robust online algorithm for tracking subspaces from highly incomplete information. The algorithm uses a robust $l^1$-norm cost function in order to estimate and track non-stationary subspaces when the streaming data vectors are corrupted with outliers. We apply GRASTA to the problems of robust m…
▽ More
This paper presents GRASTA (Grassmannian Robust Adaptive Subspace Tracking Algorithm), an efficient and robust online algorithm for tracking subspaces from highly incomplete information. The algorithm uses a robust $l^1$-norm cost function in order to estimate and track non-stationary subspaces when the streaming data vectors are corrupted with outliers. We apply GRASTA to the problems of robust matrix completion and real-time separation of background from foreground in video. In this second application, we show that GRASTA performs high-quality separation of moving objects from background at exceptional speeds: In one popular benchmark video example, GRASTA achieves a rate of 57 frames per second, even when run in MATLAB on a personal laptop.
△ Less
Submitted 20 September, 2011; v1 submitted 17 September, 2011;
originally announced September 2011.
-
Online Identification and Tracking of Subspaces from Highly Incomplete Information
Authors:
Laura Balzano,
Robert Nowak,
Benjamin Recht
Abstract:
This work presents GROUSE (Grassmanian Rank-One Update Subspace Estimation), an efficient online algorithm for tracking subspaces from highly incomplete observations. GROUSE requires only basic linear algebraic manipulations at each iteration, and each subspace update can be performed in linear time in the dimension of the subspace. The algorithm is derived by analyzing incremental gradient descen…
▽ More
This work presents GROUSE (Grassmanian Rank-One Update Subspace Estimation), an efficient online algorithm for tracking subspaces from highly incomplete observations. GROUSE requires only basic linear algebraic manipulations at each iteration, and each subspace update can be performed in linear time in the dimension of the subspace. The algorithm is derived by analyzing incremental gradient descent on the Grassmannian manifold of subspaces. With a slight modification, GROUSE can also be used as an online incremental algorithm for the matrix completion problem of imputing missing entries of a low-rank matrix. GROUSE performs exceptionally well in practice both in tracking subspaces and as an online algorithm for matrix completion.
△ Less
Submitted 12 July, 2011; v1 submitted 21 June, 2010;
originally announced June 2010.