-
Optimal Eigenvalue Shrinkage in the Semicircle Limit
Authors:
David L. Donoho,
Michael J. Feldman
Abstract:
Modern datasets are trending towards ever higher dimension. In response, recent theoretical studies of covariance estimation often assume the proportional-growth asymptotic framework, where the sample size $n$ and dimension $p$ are comparable, with $n, p \rightarrow \infty $ and $γ_n = p/n \rightarrow γ> 0$. Yet, many datasets -- perhaps most -- have very different numbers of rows and columns. We…
▽ More
Modern datasets are trending towards ever higher dimension. In response, recent theoretical studies of covariance estimation often assume the proportional-growth asymptotic framework, where the sample size $n$ and dimension $p$ are comparable, with $n, p \rightarrow \infty $ and $γ_n = p/n \rightarrow γ> 0$. Yet, many datasets -- perhaps most -- have very different numbers of rows and columns. We consider instead the disproportional-growth asymptotic framework, where $n, p \rightarrow \infty$ and $γ_n \rightarrow 0$ or $γ_n \rightarrow \infty$. Either disproportional limit induces novel behavior unseen within previous proportional and fixed-$p$ analyses. We study the spiked covariance model, with theoretical covariance a low-rank perturbation of the identity. For each of 15 different loss functions, we exhibit in closed form new optimal shrinkage and thresholding rules. Our optimal procedures demand extensive eigenvalue shrinkage and offer substantial performance benefits over the standard empirical covariance estimator.
Practitioners may ask whether to view their data as arising within (and apply the procedures of) the proportional or disproportional frameworks. Conveniently, it is possible to remain {\it framework agnostic}: one unified set of closed-form shrinkage rules (depending only on the aspect ratio $γ_n$ of the given data) offers full asymptotic optimality under either framework. At the heart of the phenomena we explore is the spiked Wigner model, in which a low-rank matrix is perturbed by symmetric noise. Exploiting a connection to the spiked covariance model as $γ_n \rightarrow 0$, we derive optimal eigenvalue shrinkage rules for estimation of the low-rank component, of independent and fundamental interest.
△ Less
Submitted 30 July, 2023; v1 submitted 10 October, 2022;
originally announced October 2022.
-
Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path
Authors:
X. Y. Han,
Vardan Papyan,
David L. Donoho
Abstract:
The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works d…
▽ More
The recently discovered Neural Collapse (NC) phenomenon occurs pervasively in today's deep net training paradigm of driving cross-entropy (CE) loss towards zero. During NC, last-layer features collapse to their class-means, both classifiers and class-means collapse to the same Simplex Equiangular Tight Frame, and classifier behavior collapses to the nearest-class-mean decision rule. Recent works demonstrated that deep nets trained with mean squared error (MSE) loss perform comparably to those trained with CE. As a preliminary, we empirically establish that NC emerges in such MSE-trained deep nets as well through experiments on three canonical networks and five benchmark datasets. We provide, in a Google Colab notebook, PyTorch code for reproducing MSE-NC and CE-NC: at https://colab.research.google.com/github/neuralcollapse/neuralcollapse/blob/main/neuralcollapse.ipynb. The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.
△ Less
Submitted 9 May, 2022; v1 submitted 3 June, 2021;
originally announced June 2021.
-
The Impossibility Region for Detecting Sparse Mixtures using the Higher Criticism
Authors:
David L. Donoho,
Alon Kipnis
Abstract:
Consider a multiple hypothesis testing setting involving rare/weak effects: relatively few tests, out of possibly many, deviate from their null hypothesis behavior. Summarizing the significance of each test by a P-value, we construct a global test against the null using the Higher Criticism (HC) statistics of these P-values. We calibrate the rare/weak model using parameters controlling the asympto…
▽ More
Consider a multiple hypothesis testing setting involving rare/weak effects: relatively few tests, out of possibly many, deviate from their null hypothesis behavior. Summarizing the significance of each test by a P-value, we construct a global test against the null using the Higher Criticism (HC) statistics of these P-values. We calibrate the rare/weak model using parameters controlling the asymptotic distribution of non-null P-values near zero. We derive a region in the parameter space where the HC test is asymptotically powerless. Our derivation involves very different tools than previously used to show the powerlessness of HC, relying on properties of the empirical processes underlying HC. In particular, our result applies to situations where HC is not asymptotically optimal, or when the asymptotically detectable region of the parameter space is unknown.
△ Less
Submitted 19 October, 2021; v1 submitted 15 February, 2021;
originally announced March 2021.
-
ScreeNOT: Exact MSE-Optimal Singular Value Thresholding in Correlated Noise
Authors:
David L. Donoho,
Matan Gavish,
Elad Romanov
Abstract:
We derive a formula for optimal hard thresholding of the singular value decomposition in the presence of correlated additive noise; although it nominally involves unobservables, we show how to apply it even where the noise covariance structure is not a-priori known or is not independently estimable.
The proposed method, which we call ScreeNOT, is a mathematically solid alternative to Cattell's e…
▽ More
We derive a formula for optimal hard thresholding of the singular value decomposition in the presence of correlated additive noise; although it nominally involves unobservables, we show how to apply it even where the noise covariance structure is not a-priori known or is not independently estimable.
The proposed method, which we call ScreeNOT, is a mathematically solid alternative to Cattell's ever-popular but vague Scree Plot heuristic from 1966.
ScreeNOT has a surprising oracle property: it typically achieves exactly, in large finite samples, the lowest possible MSE for matrix recovery, on each given problem instance - i.e. the specific threshold it selects gives exactly the smallest achievable MSE loss among all possible threshold choices for that noisy dataset and that unknown underlying true low rank model. The method is computationally efficient and robust against perturbations of the underlying covariance structure.
Our results depend on the assumption that the singular values of the noise have a limiting empirical distribution of compact support; this model, which is standard in random matrix theory, is satisfied by many models exhibiting either cross-row correlation structure or cross-column correlation structure, and also by many situations where there is inter-element correlation structure. Simulations demonstrate the effectiveness of the method even at moderate matrix sizes. The paper is supplemented by ready-to-use software packages implementing the proposed algorithm: package ScreeNOT in Python (via PyPI) and R (via CRAN).
△ Less
Submitted 26 March, 2023; v1 submitted 25 September, 2020;
originally announced September 2020.
-
Higher Criticism to Compare Two Large Frequency Tables, with sensitivity to Possible Rare and Weak Differences
Authors:
David L. Donoho,
Alon Kipnis
Abstract:
We adapt Higher Criticism (HC) to the comparison of two frequency tables which may -- or may not -- exhibit moderate differences between the tables in some unknown, relatively small subset out of a large number of categories. Our analysis of the power of the proposed HC test quantifies the rarity and size of assumed differences and applies moderate deviations-analysis to determine the asymptotic p…
▽ More
We adapt Higher Criticism (HC) to the comparison of two frequency tables which may -- or may not -- exhibit moderate differences between the tables in some unknown, relatively small subset out of a large number of categories. Our analysis of the power of the proposed HC test quantifies the rarity and size of assumed differences and applies moderate deviations-analysis to determine the asymptotic powerfulness/powerlessness of our proposed HC procedure.
Our analysis considers the null hypothesis of no difference in underlying generative model against a rare/weak perturbation alternative, in which the frequencies of $N^{1-β}$ out of the $N$ categories are perturbed by $r(\log N)/2n$ in the Hellinger distance; here $n$ is the size of each sample. Our proposed Higher Criticism (HC) test for this setting uses P-values obtained from $N$ exact binomial tests. We characterize the asymptotic performance of the HC-based test in terms of the rarity parameter $β$ and the perturbation intensity parameter $r$. Specifically, we derive a region in the $(β,r)$-plane where the test asymptotically has maximal power, while having asymptotically no power outside this region. Our analysis distinguishes between cases in which the counts in both tables are low, versus cases in which counts are high, corresponding to the cases of sparse and dense frequency tables. The phase transition curve of HC in the high-counts regime matches formally the curve delivered by HC in a two-sample normal means model.
△ Less
Submitted 21 June, 2022; v1 submitted 3 July, 2020;
originally announced July 2020.
-
Optimal Covariance Estimation for Condition Number Loss in the Spiked Model
Authors:
David L. Donoho,
Behrooz Ghorbani
Abstract:
We study estimation of the covariance matrix under relative condition number loss $κ(Σ^{-1/2} \hatΣ Σ^{-1/2})$, where $κ(Δ)$ is the condition number of matrix $Δ$, and $\hatΣ$ and $Σ$ are the estimated and theoretical covariance matrices. Optimality in $κ$-loss provides optimal guarantees in two stylized applications: Multi-User Covariance Estimation and Multi-Task Linear Discriminant Analysis. We…
▽ More
We study estimation of the covariance matrix under relative condition number loss $κ(Σ^{-1/2} \hatΣ Σ^{-1/2})$, where $κ(Δ)$ is the condition number of matrix $Δ$, and $\hatΣ$ and $Σ$ are the estimated and theoretical covariance matrices. Optimality in $κ$-loss provides optimal guarantees in two stylized applications: Multi-User Covariance Estimation and Multi-Task Linear Discriminant Analysis. We assume the so-called spiked covariance model for $Σ$, and exploit recent advances in understanding that model, to derive a nonlinear shrinker which is asymptotically optimal among orthogonally-equivariant procedures. In our asymptotic study, the number of variables $p$ is comparable to the number of observations $n$. The form of the optimal nonlinearity depends on the aspect ratio $γ=p/n$ of the data matrix and on the top eigenvalue of $Σ$. For $γ> 0.618...$, even dependence on the top eigenvalue can be avoided. The optimal shrinker has two notable properties. First, when $p/n \rightarrow γ\gg 1$ is large, it shrinks even very large eigenvalues substantially, by a factor $1/(1+γ)$. Second, even for moderate $γ$, certain highly statistically significant eigencomponents will be completely suppressed. We show that when $γ\gg 1$ is large, purely diagonal covariance matrices can be optimal, despite the top eigenvalues being large and the empirical eigenvalues being highly statistically significant. This aligns with practitioner experience. We identify intuitively reasonable procedures with small worst-case relative regret - the simplest being generalized soft thresholding having threshold at the bulk edge and slope $(1+γ)^{-1}$ above the bulk. For $γ< 2$ it has at most a few percent relative regret.
△ Less
Submitted 17 October, 2018;
originally announced October 2018.
-
Variance Breakdown of Huber (M)-estimators: $n/p \rightarrow m \in (1,\infty)$
Authors:
David L. Donoho,
Andrea Montanari
Abstract:
A half century ago, Huber evaluated the minimax asymptotic variance in scalar location estimation, $ \min_ψ\max_{F \in {\cal F}_ε} V(ψ, F) = \frac{1}{I(F_ε^*)} $, where $V(ψ,F)$ denotes the asymptotic variance of the $(M)$-estimator for location with score function $ψ$, and $I(F_ε^*)$ is the minimal Fisher information $ \min_{{\cal F}_ε} I(F)$ over the class of $ε$-Contaminated Normal distribution…
▽ More
A half century ago, Huber evaluated the minimax asymptotic variance in scalar location estimation, $ \min_ψ\max_{F \in {\cal F}_ε} V(ψ, F) = \frac{1}{I(F_ε^*)} $, where $V(ψ,F)$ denotes the asymptotic variance of the $(M)$-estimator for location with score function $ψ$, and $I(F_ε^*)$ is the minimal Fisher information $ \min_{{\cal F}_ε} I(F)$ over the class of $ε$-Contaminated Normal distributions.
We consider the linear regression model $Y = Xθ_0 + W$, $W_i\sim_{\text{i.i.d.}}F$, and iid Normal predictors $X_{i,j}$, working in the high-dimensional-limit asymptotic where the number $n$ of observations and $p$ of variables both grow large, while $n/p \rightarrow m \in (1,\infty)$; hence $m$ plays the role of `asymptotic number of observations per parameter estimated'. Let $V_m(ψ,F)$ denote the per-coordinate asymptotic variance of the $(M)$-estimator of regression in the $n/p \rightarrow m$ regime. Then $V_m \neq V$; however $V_m \rightarrow V$ as $m \rightarrow \infty$.
In this paper we evaluate the minimax asymptotic variance of the Huber $(M)$-estimate. The statistician minimizes over the family $(ψ_λ)_{λ> 0}$ of all tunings of Huber $(M)$-estimates of regression, and Nature maximizes over gross-error contaminations $F \in {\cal F}_ε$. Suppose that $I(F_ε^*) \cdot m > 1$. Then $ \min_λ\max_{F \in {\cal F}_ε} V_m(ψ_λ, F) = \frac{1}{I(F_ε^*) - 1/m} $. Strikingly, if $I(F_ε^*) \cdot m \leq 1$, then the minimax asymptotic variance is $+\infty$. The breakdown point is where the Fisher information per parameter equals unity.
△ Less
Submitted 6 March, 2015;
originally announced March 2015.
-
Optimal Shrinkage of Singular Values
Authors:
Matan Gavish,
David L. Donoho
Abstract:
We consider recovery of low-rank matrices from noisy data by shrinkage of singular values, in which a single, univariate nonlinearity is applied to each of the empirical singular values. We adopt an asymptotic framework, in which the matrix size is much larger than the rank of the signal matrix to be recovered, and the signal-to-noise ratio of the low-rank piece stays constant. For a variety of lo…
▽ More
We consider recovery of low-rank matrices from noisy data by shrinkage of singular values, in which a single, univariate nonlinearity is applied to each of the empirical singular values. We adopt an asymptotic framework, in which the matrix size is much larger than the rank of the signal matrix to be recovered, and the signal-to-noise ratio of the low-rank piece stays constant. For a variety of loss functions, including Mean Square Error (MSE - square Frobenius norm), the nuclear norm loss and the operator norm loss, we show that in this framework there is a well-defined asymptotic loss that we evaluate precisely in each case. In fact, each of the loss functions we study admits a unique admissible shrinkage nonlinearity dominating all other nonlinearities. We provide a general method for evaluating these optimal nonlinearities, and demonstrate our framework by working out simple, explicit formulas for the optimal nonlinearities in the Frobenius, nuclear and operator norm cases. For example, for a square low-rank n-by-n matrix observed in white noise with level $σ$, the optimal nonlinearity for MSE loss simply shrinks each data singular value $y$ to $\sqrt{y^2-4nσ^2 }$ (or to 0 if $y<2\sqrt{n}σ$). This optimal nonlinearity guarantees an asymptotic MSE of $2nrσ^2$, which compares favorably with optimally tuned hard thresholding and optimally tuned soft thresholding, providing guarantees of $3nrσ^2$ and $6nrσ^2$, respectively. Our general method also allows one to evaluate optimal shrinkers numerically to arbitrary precision. As an example, we compute optimal shrinkers for the Schatten-p norm loss, for any p>0.
△ Less
Submitted 15 May, 2016; v1 submitted 29 May, 2014;
originally announced May 2014.
-
Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model
Authors:
David L. Donoho,
Matan Gavish,
Iain M. Johnstone
Abstract:
We show that in a common high-dimensional covariance model, the choice of loss function has a profound effect on optimal estimation. In an asymptotic framework based on the Spiked Covariance model and use of orthogonally invariant estimators, we show that optimal estimation of the population covariance matrix boils down to design of an optimal shrinker $η$ that acts elementwise on the sample eigen…
▽ More
We show that in a common high-dimensional covariance model, the choice of loss function has a profound effect on optimal estimation. In an asymptotic framework based on the Spiked Covariance model and use of orthogonally invariant estimators, we show that optimal estimation of the population covariance matrix boils down to design of an optimal shrinker $η$ that acts elementwise on the sample eigenvalues. Indeed, to each loss function there corresponds a unique admissible eigenvalue shrinker $η^*$ dominating all other shrinkers. The shape of the optimal shrinker is determined by the choice of loss function and, crucially, by inconsistency of both eigenvalues and eigenvectors of the sample covariance matrix. Details of these phenomena and closed form formulas for the optimal eigenvalue shrinkers are worked out for a menagerie of 26 loss functions for covariance estimation found in the literature, including the Stein, Entropy, Divergence, Frechet, Bhattacharya/Matusita, Frobenius Norm, Operator Norm, Nuclear Norm and Condition Number losses.
△ Less
Submitted 4 June, 2017; v1 submitted 4 November, 2013;
originally announced November 2013.
-
The Phase Transition of Matrix Recovery from Gaussian Measurements Matches the Minimax MSE of Matrix Denoising
Authors:
David L. Donoho,
Matan Gavish,
Andrea Montanari
Abstract:
Let $X_0$ be an unknown $M$ by $N$ matrix. In matrix recovery, one takes $n < MN$ linear measurements $y_1,..., y_n$ of $X_0$, where $y_i = \Tr(a_i^T X_0)$ and each $a_i$ is a $M$ by $N$ matrix. For measurement matrices with Gaussian i.i.d entries, it known that if $X_0$ is of low rank, it is recoverable from just a few measurements. A popular approach for matrix recovery is Nuclear Norm Minimizat…
▽ More
Let $X_0$ be an unknown $M$ by $N$ matrix. In matrix recovery, one takes $n < MN$ linear measurements $y_1,..., y_n$ of $X_0$, where $y_i = \Tr(a_i^T X_0)$ and each $a_i$ is a $M$ by $N$ matrix. For measurement matrices with Gaussian i.i.d entries, it known that if $X_0$ is of low rank, it is recoverable from just a few measurements. A popular approach for matrix recovery is Nuclear Norm Minimization (NNM). Empirical work reveals a \emph{phase transition} curve, stated in terms of the undersampling fraction $δ(n,M,N) = n/(MN)$, rank fraction $ρ=r/N$ and aspect ratio $β=M/N$. Specifically, a curve $δ^* = δ^*(ρ;β)$ exists such that, if $δ> δ^*(ρ;β)$, NNM typically succeeds, while if $δ< δ^*(ρ;β)$, it typically fails. An apparently quite different problem is matrix denoising in Gaussian noise, where an unknown $M$ by $N$ matrix $X_0$ is to be estimated based on direct noisy measurements $Y = X_0 + Z$, where the matrix $Z$ has iid Gaussian entries. It has been empirically observed that, if $X_0$ has low rank, it may be recovered quite accurately from the noisy measurement $Y$. A popular matrix denoising scheme solves the unconstrained optimization problem $\text{min} \| Y - X \|_F^2/2 + λ\|X\|_* $. When optimally tuned, this scheme achieves the asymptotic minimax MSE $\cM(ρ) = \lim_{N \goto \infty} \inf_λ\sup_{\rank(X) \leq ρ\cdot N} MSE(X,\hat{X}_λ)$. We report extensive experiments showing that the phase transition $δ^*(ρ)$ in the first problem coincides with the minimax risk curve $\cM(ρ)$ in the second problem, for {\em any} rank fraction $0 < ρ< 1$.
△ Less
Submitted 10 February, 2013;
originally announced February 2013.
-
Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing
Authors:
David L. Donoho,
Adel Javanmard,
Andrea Montanari
Abstract:
We study the compressed sensing reconstruction problem for a broad class of random, band-diagonal sensing matrices. This construction is inspired by the idea of spatial coupling in coding theory. As demonstrated heuristically and numerically by Krzakala et al. \cite{KrzakalaEtAl}, message passing algorithms can effectively solve the reconstruction problem for spatially coupled measurements with un…
▽ More
We study the compressed sensing reconstruction problem for a broad class of random, band-diagonal sensing matrices. This construction is inspired by the idea of spatial coupling in coding theory. As demonstrated heuristically and numerically by Krzakala et al. \cite{KrzakalaEtAl}, message passing algorithms can effectively solve the reconstruction problem for spatially coupled measurements with undersampling rates close to the fraction of non-zero coordinates.
We use an approximate message passing (AMP) algorithm and analyze it through the state evolution method. We give a rigorous proof that this approach is successful as soon as the undersampling rate $δ$ exceeds the (upper) Rényi information dimension of the signal, $\uRenyi(p_X)$. More precisely, for a sequence of signals of diverging dimension $n$ whose empirical distribution converges to $p_X$, reconstruction is with high probability successful from $\uRenyi(p_X)\, n+o(n)$ measurements taken according to a band diagonal matrix.
For sparse signals, i.e., sequences of dimension $n$ and $k(n)$ non-zero entries, this implies reconstruction from $k(n)+o(n)$ measurements. For `discrete' signals, i.e., signals whose coordinates take a fixed finite set of values, this implies reconstruction from $o(n)$ measurements. The result is robust with respect to noise, does not apply uniquely to random signals, but requires the knowledge of the empirical distribution of the signal $p_X$.
△ Less
Submitted 18 January, 2013; v1 submitted 3 December, 2011;
originally announced December 2011.
-
Microlocal Analysis of the Geometric Separation Problem
Authors:
David L. Donoho,
Gitta Kutyniok
Abstract:
Image data are often composed of two or more geometrically distinct constituents; in galaxy catalogs, for instance, one sees a mixture of pointlike structures (galaxy superclusters) and curvelike structures (filaments). It would be ideal to process a single image and extract two geometrically `pure' images, each one containing features from only one of the two geometric constituents. This seems t…
▽ More
Image data are often composed of two or more geometrically distinct constituents; in galaxy catalogs, for instance, one sees a mixture of pointlike structures (galaxy superclusters) and curvelike structures (filaments). It would be ideal to process a single image and extract two geometrically `pure' images, each one containing features from only one of the two geometric constituents. This seems to be a seriously underdetermined problem, but recent empirical work achieved highly persuasive separations. We present a theoretical analysis showing that accurate geometric separation of point and curve singularities can be achieved by minimizing the $\ell_1$ norm of the representing coefficients in two geometrically complementary frames: wavelets and curvelets. Driving our analysis is a specific property of the ideal (but unachievable) representation where each content type is expanded in the frame best adapted to it. This ideal representation has the property that important coefficients are clustered geometrically in phase space, and that at fine scales, there is very little coherence between a cluster of elements in one frame expansion and individual elements in the complementary frame. We formally introduce notions of cluster coherence and clustered sparsity and use this machinery to show that the underdetermined systems of linear equations can be stably solved by $\ell_1$ minimization; microlocal phase space helps organize the calculations that cluster coherence requires.
△ Less
Submitted 18 April, 2010;
originally announced April 2010.
-
The Noise-Sensitivity Phase Transition in Compressed Sensing
Authors:
David L. Donoho,
Arian Maleki,
Andrea Montanari
Abstract:
Consider the noisy underdetermined system of linear equations: y=Ax0 + z0, with n x N measurement matrix A, n < N, and Gaussian white noise z0 ~ N(0,σ^2 I). Both y and A are known, both x0 and z0 are unknown, and we seek an approximation to x0. When x0 has few nonzeros, useful approximations are obtained by l1-penalized l2 minimization, in which the reconstruction \hxl solves min || y - Ax||^2/2…
▽ More
Consider the noisy underdetermined system of linear equations: y=Ax0 + z0, with n x N measurement matrix A, n < N, and Gaussian white noise z0 ~ N(0,σ^2 I). Both y and A are known, both x0 and z0 are unknown, and we seek an approximation to x0. When x0 has few nonzeros, useful approximations are obtained by l1-penalized l2 minimization, in which the reconstruction \hxl solves min || y - Ax||^2/2 + λ||x||_1.
Evaluate performance by mean-squared error (MSE = E ||\hxl - x0||_2^2/N). Consider matrices A with iid Gaussian entries and a large-system limit in which n,N\to\infty with n/N \to δand k/n \to ρ. Call the ratio MSE/σ^2 the noise sensitivity. We develop formal expressions for the MSE of \hxl, and evaluate its worst-case formal noise sensitivity over all types of k-sparse signals. The phase space 0 < δ, ρ< 1 is partitioned by curve ρ= \rhoMSE(δ) into two regions. Formal noise sensitivity is bounded throughout the region ρ< \rhoMSE(δ) and is unbounded throughout the region ρ> \rhoMSE(δ). The phase boundary ρ= \rhoMSE(δ) is identical to the previously-known phase transition curve for equivalence of l1 - l0 minimization in the k-sparse noiseless case. Hence a single phase boundary describes the fundamental phase transitions both for the noiseless and noisy cases. Extensive computational experiments validate the predictions of this formalism, including the existence of game theoretical structures underlying it. Underlying our formalism is the AMP algorithm introduced earlier by the authors. Other papers by the authors detail expressions for the formal MSE of AMP and its close connection to l1-penalized reconstruction. Here we derive the minimax formal MSE of AMP and then read out results for l1-penalized reconstruction.
△ Less
Submitted 7 April, 2010;
originally announced April 2010.
-
Optimally Tuned Iterative Reconstruction Algorithms for Compressed Sensing
Authors:
Arian Maleki,
David L. Donoho
Abstract:
We conducted an extensive computational experiment, lasting multiple CPU-years, to optimally select parameters for two important classes of algorithms for finding sparse solutions of underdetermined systems of linear equations. We make the optimally tuned implementations available at {\tt sparselab.stanford.edu}; they run `out of the box' with no user tuning: it is not necessary to select thresh…
▽ More
We conducted an extensive computational experiment, lasting multiple CPU-years, to optimally select parameters for two important classes of algorithms for finding sparse solutions of underdetermined systems of linear equations. We make the optimally tuned implementations available at {\tt sparselab.stanford.edu}; they run `out of the box' with no user tuning: it is not necessary to select thresholds or know the likely degree of sparsity. Our class of algorithms includes iterative hard and soft thresholding with or without relaxation, as well as CoSaMP, subspace pursuit and some natural extensions. As a result, our optimally tuned algorithms dominate such proposals. Our notion of optimality is defined in terms of phase transitions, i.e. we maximize the number of nonzeros at which the algorithm can successfully operate. We show that the phase transition is a well-defined quantity with our suite of random underdetermined linear systems. Our tuning gives the highest transition possible within each class of algorithms.
△ Less
Submitted 3 September, 2009;
originally announced September 2009.
-
Observed Universality of Phase Transitions in High-Dimensional Geometry, with Implications for Modern Data Analysis and Signal Processing
Authors:
David L. Donoho,
Jared Tanner
Abstract:
We review connections between phase transitions in high-dimensional combinatorial geometry and phase transitions occurring in modern high-dimensional data analysis and signal processing. In data analysis, such transitions arise as abrupt breakdown of linear model selection, robust data fitting or compressed sensing reconstructions, when the complexity of the model or the number of outliers incre…
▽ More
We review connections between phase transitions in high-dimensional combinatorial geometry and phase transitions occurring in modern high-dimensional data analysis and signal processing. In data analysis, such transitions arise as abrupt breakdown of linear model selection, robust data fitting or compressed sensing reconstructions, when the complexity of the model or the number of outliers increases beyond a threshold. In combinatorial geometry these transitions appear as abrupt changes in the properties of face counts of convex polytopes when the dimensions are varied. The thresholds in these very different problems appear in the same critical locations after appropriate calibration of variables.
These thresholds are important in each subject area: for linear modelling, they place hard limits on the degree to which the now-ubiquitous high-throughput data analysis can be successful; for robustness, they place hard limits on the degree to which standard robust fitting methods can tolerate outliers before breaking down; for compressed sensing, they define the sharp boundary of the undersampling/sparsity tradeoff in undersampling theorems.
Existing derivations of phase transitions in combinatorial geometry assume the underlying matrices have independent and identically distributed (iid) Gaussian elements. In applications, however, it often seems that Gaussianity is not required. We conducted an extensive computational experiment and formal inferential analysis to test the hypothesis that these phase transitions are {\it universal} across a range of underlying matrix ensembles. The experimental results are consistent with an asymptotic large-$n$ universality across matrix ensembles; finite-sample universality can be rejected.
△ Less
Submitted 14 June, 2009;
originally announced June 2009.
-
Counting the Faces of Randomly-Projected Hypercubes and Orthants, with Applications
Authors:
David L. Donoho,
Jared Tanner
Abstract:
Let $A$ be an $n$ by $N$ real valued random matrix, and $\h$ denote the $N$-dimensional hypercube. For numerous random matrix ensembles, the expected number of $k$-dimensional faces of the random $n$-dimensional zonotope $A\h$ obeys the formula $E f_k(A\h) /f_k(\h) = 1-P_{N-n,N-k}$, where $P_{N-n,N-k}$ is a fair-coin-tossing probability. The formula applies, for example, where the columns of…
▽ More
Let $A$ be an $n$ by $N$ real valued random matrix, and $\h$ denote the $N$-dimensional hypercube. For numerous random matrix ensembles, the expected number of $k$-dimensional faces of the random $n$-dimensional zonotope $A\h$ obeys the formula $E f_k(A\h) /f_k(\h) = 1-P_{N-n,N-k}$, where $P_{N-n,N-k}$ is a fair-coin-tossing probability. The formula applies, for example, where the columns of $A$ are drawn i.i.d. from an absolutely continuous symmetric distribution. The formula exploits Wendel's Theorem\cite{We62}.
Let $\po$ denote the positive orthant; the expected number of $k$-faces of the random cone$A \po$ obeys $ {\cal E} f_k(A\po) /f_k(\po) = 1 - P_{N-n,N-k}$. The formula applies to numerous matrix ensembles, including those with iid random columns from an absolutely continuous, centrally symmetric distribution. There is an asymptotically sharp threshold in the behavior of face counts of the projected hypercube; thresholds known for projecting the simplex and the cross-polytope, occur at very different locations. We briefly consider face counts of the projected orthant when $A$ does not have mean zero; these do behave similarly to those for the projected simplex. We consider non-random projectors of the orthant; the 'best possible' $A$ is the one associated with the first $n$ rows of the Fourier matrix.
These geometric face-counting results have implications for signal processing, information theory, inverse problems, and optimization. Most of these flow in some way from the fact that face counting is related to conditions for uniqueness of solutions of underdetermined systems of linear equations.
△ Less
Submitted 22 July, 2008;
originally announced July 2008.
-
Does median filtering truly preserve edges better than linear filtering?
Authors:
Ery Arias-Castro,
David L. Donoho
Abstract:
Image processing researchers commonly assert that "median filtering is better than linear filtering for removing noise in the presence of edges." Using a straightforward large-$n$ decision-theory framework, this folk-theorem is seen to be false in general. We show that median filtering and linear filtering have similar asymptotic worst-case mean-squared error (MSE) when the signal-to-noise ratio…
▽ More
Image processing researchers commonly assert that "median filtering is better than linear filtering for removing noise in the presence of edges." Using a straightforward large-$n$ decision-theory framework, this folk-theorem is seen to be false in general. We show that median filtering and linear filtering have similar asymptotic worst-case mean-squared error (MSE) when the signal-to-noise ratio (SNR) is of order 1, which corresponds to the case of constant per-pixel noise level in a digital signal. To see dramatic benefits of median smoothing in an asymptotic setting, the per-pixel noise level should tend to zero (i.e., SNR should grow very large). We show that a two-stage median filtering using two very different window widths can dramatically outperform traditional linear and median filtering in settings where the underlying object has edges. In this two-stage procedure, the first pass, at a fine scale, aims at increasing the SNR. The second pass, at a coarser scale, correctly exploits the nonlinearity of the median. Image processing methods based on nonlinear partial differential equations (PDEs) are often said to improve on linear filtering in the presence of edges. Such methods seem difficult to analyze rigorously in a decision-theoretic framework. A popular example is mean curvature motion (MCM), which is formally a kind of iterated median filtering. Our results on iterated median filtering suggest that some PDE-based methods are candidates to rigorously outperform linear filtering in an asymptotic framework.
△ Less
Submitted 20 April, 2009; v1 submitted 14 December, 2006;
originally announced December 2006.
-
Counting faces of randomly-projected polytopes when the projection radically lowers dimension
Authors:
David L. Donoho,
Jared Tanner
Abstract:
This paper develops asymptotic methods to count faces of random high-dimensional polytopes. Beyond its intrinsic interest, our conclusions have surprising implications - in statistics, probability, information theory, and signal processing - with potential impacts in practical subjects like medical imaging and digital communications. Three such implications concern: convex hulls of Gaussian poin…
▽ More
This paper develops asymptotic methods to count faces of random high-dimensional polytopes. Beyond its intrinsic interest, our conclusions have surprising implications - in statistics, probability, information theory, and signal processing - with potential impacts in practical subjects like medical imaging and digital communications. Three such implications concern: convex hulls of Gaussian point clouds, signal recovery from random projections, and how many gross errors can be efficiently corrected from Gaussian error correcting codes.
△ Less
Submitted 26 September, 2006; v1 submitted 15 July, 2006;
originally announced July 2006.
-
Adaptive multiscale detection of filamentary structures in a background of uniform random points
Authors:
Ery Arias-Castro,
David L. Donoho,
Xiaoming Huo
Abstract:
We are given a set of $n$ points that might be uniformly distributed in the unit square $[0,1]^2$. We wish to test whether the set, although mostly consisting of uniformly scattered points, also contains a small fraction of points sampled from some (a priori unknown) curve with $C^α$-norm bounded by $β$. An asymptotic detection threshold exists in this problem; for a constant $T_-(α,β)>0$, if th…
▽ More
We are given a set of $n$ points that might be uniformly distributed in the unit square $[0,1]^2$. We wish to test whether the set, although mostly consisting of uniformly scattered points, also contains a small fraction of points sampled from some (a priori unknown) curve with $C^α$-norm bounded by $β$. An asymptotic detection threshold exists in this problem; for a constant $T_-(α,β)>0$, if the number of points sampled from the curve is smaller than $T_-(α,β)n^{1/(1+α)}$, reliable detection is not possible for large $n$. We describe a multiscale significant-runs algorithm that can reliably detect concentration of data near a smooth curve, without knowing the smoothness information $α$ or $β$ in advance, provided that the number of points on the curve exceeds $T_*(α,β)n^{1/(1+α)}$. This algorithm therefore has an optimal detection threshold, up to a factor $T_*/T_-$. At the heart of our approach is an analysis of the data by counting membership in multiscale multianisotropic strips. The strips will have area $2/n$ and exhibit a variety of lengths, orientations and anisotropies. The strips are partitioned into anisotropy classes; each class is organized as a directed graph whose vertices all are strips of the same anisotropy and whose edges link such strips to their ``good continuations.'' The point-cloud data are reduced to counts that measure membership in strips. Each anisotropy graph is reduced to a subgraph that consist of strips with significant counts. The algorithm rejects $\mathbf{H}_0$ whenever some such subgraph contains a path that connects many consecutive significant counts.
△ Less
Submitted 18 May, 2006;
originally announced May 2006.
-
Correction. Connect The Dots: How Many Random Points Can A Regular Curve Pass Through?
Authors:
E. Arias-Castro,
D. L. Donoho,
X. Huo,
C. A. Tovey
Abstract:
Correction for Adv. in Appl. Probab. 37, no. 3 (2005), 571-603
Correction for Adv. in Appl. Probab. 37, no. 3 (2005), 571-603
△ Less
Submitted 28 March, 2006;
originally announced March 2006.
-
Adapting to Unknown Sparsity by controlling the False Discovery Rate
Authors:
Felix Abramovich,
Yoav Benjamini,
David L. Donoho,
Iain M. Johnstone
Abstract:
We attempt to recover an $n$-dimensional vector observed in white noise, where $n$ is large and the vector is known to be sparse, but the degree of sparsity is unknown. We consider three different ways of defining sparsity of a vector: using the fraction of nonzero terms; imposing power-law decay bounds on the ordered entries; and controlling the $\ell_p$ norm for $p$ small. We obtain a procedur…
▽ More
We attempt to recover an $n$-dimensional vector observed in white noise, where $n$ is large and the vector is known to be sparse, but the degree of sparsity is unknown. We consider three different ways of defining sparsity of a vector: using the fraction of nonzero terms; imposing power-law decay bounds on the ordered entries; and controlling the $\ell_p$ norm for $p$ small. We obtain a procedure which is asymptotically minimax for $\ell^r$ loss, simultaneously throughout a range of such sparsity classes.
The optimal procedure is a data-adaptive thresholding scheme, driven by control of the {\it False Discovery Rate} (FDR). FDR control is a relatively recent innovation in simultaneous testing, ensuring that at most a certain fraction of the rejected null hypotheses will correspond to false rejections.
In our treatment, the FDR control parameter $q_n$ also plays a determining role in asymptotic minimaxity. If $q = \lim q_n \in [0,1/2]$ and also $q_n > γ/\log(n)$ we get sharp asymptotic minimaxity, simultaneously, over a wide range of sparse parameter spaces and loss functions. On the other hand, $ q = \lim q_n \in (1/2,1]$, forces the risk to exceed the minimax risk by a factor growing with $q$.
To our knowledge, this relation between ideas in simultaneous inference and asymptotic decision theory is new.
Our work provides a new perspective on a class of model selection rules which has been introduced recently by several authors. These new rules impose complexity penalization of the form $2 \cdot \log({potential model size} / {actual model size})$. We exhibit a close connection with FDR-controlling procedures under stringent control of the false discovery rate.
△ Less
Submitted 18 May, 2005;
originally announced May 2005.
-
Emerging applications of geometric multiscale analysis
Authors:
David L. Donoho
Abstract:
Classical multiscale analysis based on wavelets has a number of successful applications, e.g. in data compression, fast algorithms, and noise removal. Wavelets, however, are adapted to point singularities, and many phenomena in several variables exhibit intermediate-dimensional singularities, such as edges, filaments, and sheets. This suggests that in higher dimensions, wavelets ought to be repl…
▽ More
Classical multiscale analysis based on wavelets has a number of successful applications, e.g. in data compression, fast algorithms, and noise removal. Wavelets, however, are adapted to point singularities, and many phenomena in several variables exhibit intermediate-dimensional singularities, such as edges, filaments, and sheets. This suggests that in higher dimensions, wavelets ought to be replaced in certain applications by multiscale analysis adapted to intermediate-dimensional singularities.
My lecture described various initial attempts in this direction. In particular, I discussed two approaches to geometric multiscale analysis originally arising in the work of Harmonic Analysts Hart Smith and Peter Jones (and others): (a) a directional wavelet transform based on parabolic dilations; and (b) analysis via anistropic strips. Perhaps surprisingly, these tools have potential applications in data compression, inverse problems, noise removal, and signal detection; applied mathematicians, statisticians, and engineers are eagerly pursuing these leads.
△ Less
Submitted 30 November, 2002;
originally announced December 2002.