-
Asymptotic Gaussian Fluctuations of Eigenvectors in Spectral Clustering
Authors:
Hugo Lebeau,
Florent Chatelain,
Romain Couillet
Abstract:
The performance of spectral clustering relies on the fluctuations of the entries of the eigenvectors of a similarity matrix, which has been left uncharacterized until now. In this letter, it is shown that the signal $+$ noise structure of a general spike random matrix model is transferred to the eigenvectors of the corresponding Gram kernel matrix and the fluctuations of their entries are Gaussian…
▽ More
The performance of spectral clustering relies on the fluctuations of the entries of the eigenvectors of a similarity matrix, which has been left uncharacterized until now. In this letter, it is shown that the signal $+$ noise structure of a general spike random matrix model is transferred to the eigenvectors of the corresponding Gram kernel matrix and the fluctuations of their entries are Gaussian in the large-dimensional regime. This CLT-like result was the last missing piece to precisely predict the classification performance of spectral clustering. The proposed proof is very general and relies solely on the rotational invariance of the noise. Numerical experiments on synthetic and real data illustrate the universality of this phenomenon.
△ Less
Submitted 27 May, 2024; v1 submitted 19 February, 2024;
originally announced February 2024.
-
A Random Matrix Approach to Low-Multilinear-Rank Tensor Approximation
Authors:
Hugo Lebeau,
Florent Chatelain,
Romain Couillet
Abstract:
This work presents a comprehensive understanding of the estimation of a planted low-rank signal from a general spiked tensor model near the computational threshold. Relying on standard tools from the theory of large random matrices, we characterize the large-dimensional spectral behavior of the unfoldings of the data tensor and exhibit relevant signal-to-noise ratios governing the detectability of…
▽ More
This work presents a comprehensive understanding of the estimation of a planted low-rank signal from a general spiked tensor model near the computational threshold. Relying on standard tools from the theory of large random matrices, we characterize the large-dimensional spectral behavior of the unfoldings of the data tensor and exhibit relevant signal-to-noise ratios governing the detectability of the principal directions of the signal. These results allow to accurately predict the reconstruction performance of truncated multilinear SVD (MLSVD) in the non-trivial regime. This is particularly important since it serves as an initialization of the higher-order orthogonal iteration (HOOI) scheme, whose convergence to the best low-multilinear-rank approximation depends entirely on its initialization. We give a sufficient condition for the convergence of HOOI and show that the number of iterations before convergence tends to $1$ in the large-dimensional limit.
△ Less
Submitted 14 January, 2025; v1 submitted 5 February, 2024;
originally announced February 2024.
-
Two-way kernel matrix puncturing: towards resource-efficient PCA and spectral clustering
Authors:
Romain Couillet,
Florent Chatelain,
Nicolas Le Bihan
Abstract:
The article introduces an elementary cost and storage reduction method for spectral clustering and principal component analysis. The method consists in randomly "puncturing" both the data matrix $X\in\mathbb{C}^{p\times n}$ (or $\mathbb{R}^{p\times n}$) and its corresponding kernel (Gram) matrix $K$ through Bernoulli masks: $S\in\{0,1\}^{p\times n}$ for $X$ and $B\in\{0,1\}^{n\times n}$ for $K$. T…
▽ More
The article introduces an elementary cost and storage reduction method for spectral clustering and principal component analysis. The method consists in randomly "puncturing" both the data matrix $X\in\mathbb{C}^{p\times n}$ (or $\mathbb{R}^{p\times n}$) and its corresponding kernel (Gram) matrix $K$ through Bernoulli masks: $S\in\{0,1\}^{p\times n}$ for $X$ and $B\in\{0,1\}^{n\times n}$ for $K$. The resulting "two-way punctured" kernel is thus given by $K=\frac{1}{p}[(X \odot S)^{\sf H} (X \odot S)] \odot B$. We demonstrate that, for $X$ composed of independent columns drawn from a Gaussian mixture model, as $n,p\to\infty$ with $p/n\to c_0\in(0,\infty)$, the spectral behavior of $K$ -- its limiting eigenvalue distribution, as well as its isolated eigenvalues and eigenvectors -- is fully tractable and exhibits a series of counter-intuitive phenomena. We notably prove, and empirically confirm on GAN-generated image databases, that it is possible to drastically puncture the data, thereby providing possibly huge computational and storage gains, for a virtually constant (clustering of PCA) performance. This preliminary study opens as such the path towards rethinking, from a large dimensional standpoint, computational and storage costs in elementary machine learning models.
△ Less
Submitted 17 May, 2021; v1 submitted 24 February, 2021;
originally announced February 2021.
-
Asymptotic regime for impropriety tests of complex random vectors
Authors:
Florent Chatelain,
Nicolas Le Bihan,
Jonathan H. Manton
Abstract:
Impropriety testing for complex-valued vector has been considered lately due to potential applications ranging from digital communications to complex media imaging. This paper provides new results for such tests in the asymptotic regime, i.e. when the vector dimension and sample size grow commensurately to infinity. The studied tests are based on invariant statistics named impropriety coefficients…
▽ More
Impropriety testing for complex-valued vector has been considered lately due to potential applications ranging from digital communications to complex media imaging. This paper provides new results for such tests in the asymptotic regime, i.e. when the vector dimension and sample size grow commensurately to infinity. The studied tests are based on invariant statistics named impropriety coefficients. Limiting distributions for these statistics are derived, together with those of the Generalized Likelihood Ratio Test (GLRT) and Roy's test, in the Gaussian case. This characterization in the asymptotic regime allows also to identify a phase transition in Roy's test with potential application in detection of complex-valued low-rank subspace corrupted by proper noise in large datasets. Simulations illustrate the accuracy of the proposed asymptotic approximations.
△ Less
Submitted 6 January, 2020; v1 submitted 29 April, 2019;
originally announced April 2019.
-
Robust control of varying weak hyperspectral target detection with sparse non-negative representation
Authors:
Raphael Bacher,
Celine Meillier,
Florent Chatelain,
Olivier Michel
Abstract:
In this study, a multiple-comparison approach is developed for detecting faint hyperspectral sources. The detection method relies on a sparse and non-negative representation on a highly coherent dictionary to track a spatially varying source. A robust control of the detection errors is ensured by learning the test statistic distributions on the data. The resulting control is based on the false dis…
▽ More
In this study, a multiple-comparison approach is developed for detecting faint hyperspectral sources. The detection method relies on a sparse and non-negative representation on a highly coherent dictionary to track a spatially varying source. A robust control of the detection errors is ensured by learning the test statistic distributions on the data. The resulting control is based on the false discovery rate, to take into account the large number of pixels to be tested. This method is applied to data recently recorded by the three-dimensional spectrograph Multi-Unit Spectrograph Explorer.
△ Less
Submitted 27 March, 2017; v1 submitted 2 February, 2017;
originally announced February 2017.
-
Isotropic Multiple Scattering Processes on Hyperspheres
Authors:
Nicolas Le Bihan,
Florent Chatelain,
Jonathan H. Manton
Abstract:
This paper presents several results about isotropic random walks and multiple scattering processes on hyperspheres ${\mathbb S}^{p-1}$. It allows one to derive the Fourier expansions on ${\mathbb S}^{p-1}$ of these processes. A result of unimodality for the multiconvolution of symmetrical probability density functions (pdf) on ${\mathbb S}^{p-1}$ is also introduced. Such processes are then studied…
▽ More
This paper presents several results about isotropic random walks and multiple scattering processes on hyperspheres ${\mathbb S}^{p-1}$. It allows one to derive the Fourier expansions on ${\mathbb S}^{p-1}$ of these processes. A result of unimodality for the multiconvolution of symmetrical probability density functions (pdf) on ${\mathbb S}^{p-1}$ is also introduced. Such processes are then studied in the case where the scattering distribution is von Mises Fisher (vMF). Asymptotic distributions for the multiconvolution of vMFs on ${\mathbb S}^{p-1}$ are obtained. Both Fourier expansion and asymptotic approximation allows us to compute estimation bounds for the parameters of Compound Cox Processes (CCP) on ${\mathbb S}^{p-1}$.
△ Less
Submitted 13 December, 2015; v1 submitted 12 August, 2014;
originally announced August 2014.
-
Bayesian Model for Multiple Change-points Detection in Multivariate Time Series
Authors:
Flore Harlé,
Florent Chatelain,
Cédric Gouy-Pailler,
Sophie Achard
Abstract:
This paper addresses the issue of detecting change-points in multivariate time series. The proposed approach differs from existing counterparts by making only weak assumptions on both the change-points structure across series, and the statistical signal distributions. Specifically change-points are not assumed to occur at simultaneous time instants across series, and no specific distribution is as…
▽ More
This paper addresses the issue of detecting change-points in multivariate time series. The proposed approach differs from existing counterparts by making only weak assumptions on both the change-points structure across series, and the statistical signal distributions. Specifically change-points are not assumed to occur at simultaneous time instants across series, and no specific distribution is assumed on the individual signals. It relies on the combination of a local robust statistical test acting on individual time segments, with a global Bayesian framework able to optimize configurations from multiple local statistics (from segments of a unique time series or multiple time series). Using an extensive experimental set-up, our algorithm is shown to perform well on Gaussian data, with the same results in term of recall and precision as classical approaches, such as the fused lasso and the Bernoulli Gaussian model. Furthermore, it outperforms the reference models in the case of non normal data with outliers. The control of the False Discovery Rate by an acceptance level is confirmed. In the case of multivariate data, the probabilities that simultaneous change-points are shared by some specific time series are learned. We finally illustrate our algorithm with real datasets from energy monitoring and genomic. Segmentations are compared to state-of-the-art approaches based on fused lasso and group fused lasso.
△ Less
Submitted 11 July, 2014;
originally announced July 2014.