Search | arXiv e-print repository

Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds

Authors: Yang Guo, Yutian Tao, Yifei Ming, Robert D. Nowak, Yingyu Liang

Abstract: Retrieval-augmented generation (RAG) has seen many empirical successes in recent years by aiding the LLM with external knowledge. However, its theoretical aspect has remained mostly unexplored. In this paper, we propose the first finite-sample generalization bound for RAG in in-context linear regression and derive an exact bias-variance tradeoff. Our framework views the retrieved texts as query-de… ▽ More Retrieval-augmented generation (RAG) has seen many empirical successes in recent years by aiding the LLM with external knowledge. However, its theoretical aspect has remained mostly unexplored. In this paper, we propose the first finite-sample generalization bound for RAG in in-context linear regression and derive an exact bias-variance tradeoff. Our framework views the retrieved texts as query-dependent noisy in-context examples and recovers the classical in-context learning (ICL) and standard RAG as the limit cases. Our analysis suggests that an intrinsic ceiling on generalization error exists on RAG as opposed to the ICL. Furthermore, our framework is able to model retrieval both from the training data and from external corpora by introducing uniform and non-uniform RAG noise. In line with our theory, we show the sample efficiency of ICL and RAG empirically with experiments on common QA benchmarks, such as Natural Questions and TriviaQA. △ Less

Submitted 9 June, 2025; v1 submitted 3 June, 2025; originally announced June 2025.

Comments: Under Review

arXiv:2502.17671 [pdf, ps, other]

Optimal Recovery Meets Minimax Estimation

Authors: Ronald DeVore, Robert D. Nowak, Rahul Parhi, Guergana Petrova, Jonathan W. Siegel

Abstract: A fundamental problem in statistics and machine learning is to estimate a function $f$ from possibly noisy observations of its point samples. The goal is to design a numerical algorithm to construct an approximation $\hat f$ to $f$ in a prescribed norm that asymptotically achieves the best possible error (as a function of the number $m$ of observations and the variance $σ^2$ of the noise). This pr… ▽ More A fundamental problem in statistics and machine learning is to estimate a function $f$ from possibly noisy observations of its point samples. The goal is to design a numerical algorithm to construct an approximation $\hat f$ to $f$ in a prescribed norm that asymptotically achieves the best possible error (as a function of the number $m$ of observations and the variance $σ^2$ of the noise). This problem has received considerable attention in both nonparametric statistics (noisy observations) and optimal recovery (noiseless observations). Quantitative bounds require assumptions on $f$, known as model class assumptions. Classical results assume that $f$ is in the unit ball of a Besov space. In nonparametric statistics, the best possible performance of an algorithm for finding $\hat f$ is known as the minimax rate and has been studied in this setting under the assumption that the noise is Gaussian. In optimal recovery, the best possible performance of an algorithm is known as the optimal recovery rate and has also been determined in this setting. While one would expect that the minimax rate recovers the optimal recovery rate when the noise level $σ$ tends to zero, it turns out that the current results on minimax rates do not carefully determine the dependence on $σ$ and the limit cannot be taken. This paper handles this issue and determines the noise-level-aware (NLA) minimax rates for Besov classes when error is measured in an $L_q$-norm with matching upper and lower bounds. The end result is a reconciliation between minimax rates and optimal recovery rates. The NLA minimax rate continuously depends on the noise level and recovers the optimal recovery rate when $σ$ tends to zero. △ Less

Submitted 28 May, 2025; v1 submitted 24 February, 2025; originally announced February 2025.

arXiv:2309.01753 [pdf, other]

On Penalty Methods for Nonconvex Bilevel Optimization and First-Order Stochastic Approximation

Authors: Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, Robert Nowak

Abstract: In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty paramet… ▽ More In this work, we study first-order algorithms for solving Bilevel Optimization (BO) where the objective functions are smooth but possibly nonconvex in both levels and the variables are restricted to closed convex sets. As a first step, we study the landscape of BO through the lens of penalty methods, in which the upper- and lower-level objectives are combined in a weighted sum with penalty parameter $σ> 0$. In particular, we establish a strong connection between the penalty function and the hyper-objective by explicitly characterizing the conditions under which the values and derivatives of the two must be $O(σ)$-close. A by-product of our analysis is the explicit formula for the gradient of hyper-objective when the lower-level problem has multiple solutions under minimal conditions, which could be of independent interest. Next, viewing the penalty formulation as $O(σ)$-approximation of the original BO, we propose first-order algorithms that find an $ε$-stationary solution by optimizing the penalty formulation with $σ= O(ε)$. When the perturbed lower-level problem uniformly satisfies the small-error proximal error-bound (EB) condition, we propose a first-order algorithm that converges to an $ε$-stationary point of the penalty function, using in total $O(ε^{-3})$ and $O(ε^{-7})$ accesses to first-order (stochastic) gradient oracles when the oracle is deterministic and oracles are noisy, respectively. Under an additional assumption on stochastic oracles, we show that the algorithm can be implemented in a fully {\it single-loop} manner, i.e., with $O(1)$ samples per iteration, and achieves the improved oracle-complexity of $O(ε^{-3})$ and $O(ε^{-5})$, respectively. △ Less

Submitted 11 February, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

Comments: ICLR 2024

arXiv:2307.15772 [pdf, ps, other]

doi 10.1016/j.acha.2024.101713

Weighted variation spaces and approximation by shallow ReLU networks

Authors: Ronald DeVore, Robert D. Nowak, Rahul Parhi, Jonathan W. Siegel

Abstract: We investigate the approximation of functions $f$ on a bounded domain $Ω\subset \mathbb{R}^d$ by the outputs of single-hidden-layer ReLU neural networks of width $n$. This form of nonlinear $n$-term dictionary approximation has been intensely studied since it is the simplest case of neural network approximation (NNA). There are several celebrated approximation results for this form of NNA that int… ▽ More We investigate the approximation of functions $f$ on a bounded domain $Ω\subset \mathbb{R}^d$ by the outputs of single-hidden-layer ReLU neural networks of width $n$. This form of nonlinear $n$-term dictionary approximation has been intensely studied since it is the simplest case of neural network approximation (NNA). There are several celebrated approximation results for this form of NNA that introduce novel model classes of functions on $Ω$ whose approximation rates do not grow unbounded with the input dimension. These novel classes include Barron classes, and classes based on sparsity or variation such as the Radon-domain BV classes. The present paper is concerned with the definition of these novel model classes on domains $Ω$. The current definition of these model classes does not depend on the domain $Ω$. A new and more proper definition of model classes on domains is given by introducing the concept of weighted variation spaces. These new model classes are intrinsic to the domain itself. The importance of these new model classes is that they are strictly larger than the classical (domain-independent) classes. Yet, it is shown that they maintain the same NNA rates. △ Less

Submitted 13 October, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

Journal ref: Applied and Computational Harmonic Analysis, vol. 74, no. 101713, pp. 1-22, 2025

arXiv:2301.10945 [pdf, other]

A Fully First-Order Method for Stochastic Bilevel Optimization

Authors: Jeongyeol Kwon, Dohyun Kwon, Stephen Wright, Robert Nowak

Abstract: We consider stochastic unconstrained bilevel optimization problems when only the first-order gradient oracles are available. While numerous optimization methods have been proposed for tackling bilevel problems, existing methods either tend to require possibly expensive calculations regarding Hessians of lower-level objectives, or lack rigorous finite-time performance guarantees. In this work, we p… ▽ More We consider stochastic unconstrained bilevel optimization problems when only the first-order gradient oracles are available. While numerous optimization methods have been proposed for tackling bilevel problems, existing methods either tend to require possibly expensive calculations regarding Hessians of lower-level objectives, or lack rigorous finite-time performance guarantees. In this work, we propose a Fully First-order Stochastic Approximation (F2SA) method, and study its non-asymptotic convergence properties. Specifically, we show that F2SA converges to an $ε$-stationary solution of the bilevel problem after $ε^{-7/2}, ε^{-5/2}$, and $ε^{-3/2}$ iterations (each iteration using $O(1)$ samples) when stochastic noises are in both level objectives, only in the upper-level objective, and not present (deterministic settings), respectively. We further show that if we employ momentum-assisted gradient estimators, the iteration complexities can be improved to $ε^{-5/2}, ε^{-4/2}$, and $ε^{-3/2}$, respectively. We demonstrate even superior practical performance of the proposed method over existing second-order based approaches on MNIST data-hypercleaning experiments. △ Less

Submitted 26 January, 2023; originally announced January 2023.

arXiv:2109.08844 [pdf, other]

doi 10.1109/TIT.2022.3208653

Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks

Authors: Rahul Parhi, Robert D. Nowak

Abstract: We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performanc… ▽ More We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the second-order Radon-domain bounded variation space. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. This minimax rate is immune to the curse of dimensionality. We quantify an explicit gap between neural networks and linear methods (which include kernel methods) by deriving a linear minimax lower bound for the estimation problem, showing that linear methods necessarily suffer the curse of dimensionality in this function space. As a result, this paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality. △ Less

Submitted 12 October, 2022; v1 submitted 18 September, 2021; originally announced September 2021.

Comments: IEEE Transactions on Information Theory (in press)

Journal ref: IEEE Transactions on Information Theory, vol. 69, no. 2, pp. 1125-1140, Feb. 2023

arXiv:2002.01044 [pdf, other]

Optimal Confidence Regions for the Multinomial Parameter

Authors: Matthew L. Malloy, Ardhendu Tripathy, Robert D. Nowak

Abstract: Construction of tight confidence regions and intervals is central to statistical inference and decision making. This paper develops new theory showing minimum average volume confidence regions for categorical data. More precisely, consider an empirical distribution $\widehat{\boldsymbol{p}}$ generated from $n$ iid realizations of a random variable that takes one of $k$ possible values according to… ▽ More Construction of tight confidence regions and intervals is central to statistical inference and decision making. This paper develops new theory showing minimum average volume confidence regions for categorical data. More precisely, consider an empirical distribution $\widehat{\boldsymbol{p}}$ generated from $n$ iid realizations of a random variable that takes one of $k$ possible values according to an unknown distribution $\boldsymbol{p}$. This is analogous to a single draw from a multinomial distribution. A confidence region is a subset of the probability simplex that depends on $\widehat{\boldsymbol{p}}$ and contains the unknown $\boldsymbol{p}$ with a specified confidence. This paper shows how one can construct minimum average volume confidence regions, answering a long standing question. We also show the optimality of the regions directly translates to optimal confidence intervals of linear functionals such as the mean, implying sample complexity and regret improvements for adaptive machine learning algorithms. △ Less

Submitted 29 January, 2021; v1 submitted 3 February, 2020; originally announced February 2020.

arXiv:1912.03528 [pdf, other]

Tighter Confidence Intervals for Rating Systems

Authors: Robert Nowak, Ervin Tánczos

Abstract: Rating systems are ubiquitous, with applications ranging from product recommendation to teaching evaluations. Confidence intervals for functionals of rating data such as empirical means or quantiles are critical to decision-making in various applications including recommendation/ranking algorithms. Confidence intervals derived from standard Hoeffding and Bernstein bounds can be quite loose, especi… ▽ More Rating systems are ubiquitous, with applications ranging from product recommendation to teaching evaluations. Confidence intervals for functionals of rating data such as empirical means or quantiles are critical to decision-making in various applications including recommendation/ranking algorithms. Confidence intervals derived from standard Hoeffding and Bernstein bounds can be quite loose, especially in small sample regimes, since these bounds do not exploit the geometric structure of the probability simplex. We propose a new approach to deriving confidence intervals that are tailored to the geometry associated with multi-star/value rating systems using a combination of techniques from information theory, including Kullback-Leibler, Sanov, and Csisz{á}r inequalities. The new confidence intervals are almost always as good or better than all standard methods and are significantly tighter in many situations. The standard bounds can require several times more samples than our new bounds to achieve specified confidence interval widths. △ Less

Submitted 7 December, 2019; originally announced December 2019.

arXiv:1809.06522 [pdf, other]

Concentration Inequalities for the Empirical Distribution

Authors: Jay Mardia, Jiantao Jiao, Ervin Tánczos, Robert D. Nowak, Tsachy Weissman

Abstract: We study concentration inequalities for the Kullback--Leibler (KL) divergence between the empirical distribution and the true distribution. Applying a recursion technique, we improve over the method of types bound uniformly in all regimes of sample size $n$ and alphabet size $k$, and the improvement becomes more significant when $k$ is large. We discuss the applications of our results in obtaining… ▽ More We study concentration inequalities for the Kullback--Leibler (KL) divergence between the empirical distribution and the true distribution. Applying a recursion technique, we improve over the method of types bound uniformly in all regimes of sample size $n$ and alphabet size $k$, and the improvement becomes more significant when $k$ is large. We discuss the applications of our results in obtaining tighter concentration inequalities for $L_1$ deviations of the empirical distribution from the true distribution, and the difference between concentration around the expectation or zero. We also obtain asymptotically tight bounds on the variance of the KL divergence between the empirical and true distribution, and demonstrate their quantitatively different behaviors between small and large sample sizes compared to the alphabet size. △ Less

Submitted 18 October, 2019; v1 submitted 18 September, 2018; originally announced September 2018.

Comments: Accepted for publication in Information and Inference

arXiv:1709.03570 [pdf, other]

A KL-LUCB Bandit Algorithm for Large-Scale Crowdsourcing

Authors: Bob Mankoff, Robert Nowak, Ervin Tanczos

Abstract: This paper focuses on best-arm identification in multi-armed bandits with bounded rewards. We develop an algorithm that is a fusion of lil-UCB and KL-LUCB, offering the best qualities of the two algorithms in one method. This is achieved by proving a novel anytime confidence bound for the mean of bounded distributions, which is the analogue of the LIL-type bounds recently developed for sub-Gaussia… ▽ More This paper focuses on best-arm identification in multi-armed bandits with bounded rewards. We develop an algorithm that is a fusion of lil-UCB and KL-LUCB, offering the best qualities of the two algorithms in one method. This is achieved by proving a novel anytime confidence bound for the mean of bounded distributions, which is the analogue of the LIL-type bounds recently developed for sub-Gaussian distributions. We corroborate our theoretical results with numerical experiments based on the New Yorker Cartoon Caption Contest. △ Less

Submitted 11 September, 2017; originally announced September 2017.

arXiv:1707.04300 [pdf, other]

Coalescent-based species tree estimation: a stochastic Farris transform

Authors: Gautam Dasarathy, Elchanan Mossel, Robert Nowak, Sebastien Roch

Abstract: The reconstruction of a species phylogeny from genomic data faces two significant hurdles: 1) the trees describing the evolution of each individual gene--i.e., the gene trees--may differ from the species phylogeny and 2) the molecular sequences corresponding to each gene often provide limited information about the gene trees themselves. In this paper we consider an approach to species tree reconst… ▽ More The reconstruction of a species phylogeny from genomic data faces two significant hurdles: 1) the trees describing the evolution of each individual gene--i.e., the gene trees--may differ from the species phylogeny and 2) the molecular sequences corresponding to each gene often provide limited information about the gene trees themselves. In this paper we consider an approach to species tree reconstruction that addresses both these hurdles. Specifically, we propose an algorithm for phylogeny reconstruction under the multispecies coalescent model with a standard model of site substitution. The multispecies coalescent is commonly used to model gene tree discordance due to incomplete lineage sorting, a well-studied population-genetic effect. In previous work, an information-theoretic trade-off was derived in this context between the number of loci, $m$, needed for an accurate reconstruction and the length of the locus sequences, $k$. It was shown that to reconstruct an internal branch of length $f$, one needs $m$ to be of the order of $1/[f^{2} \sqrt{k}]$. That previous result was obtained under the molecular clock assumption, i.e., under the assumption that mutation rates (as well as population sizes) are constant across the species phylogeny. Here we generalize this result beyond the restrictive molecular clock assumption, and obtain a new reconstruction algorithm that has the same data requirement (up to log factors). Our main contribution is a novel reduction to the molecular clock case under the multispecies coalescent. As a corollary, we also obtain a new identifiability result of independent interest: for any species tree with $n \geq 3$ species, the rooted species tree can be identified from the distribution of its unrooted weighted gene trees even in the absence of a molecular clock. △ Less

Submitted 13 July, 2017; originally announced July 2017.

Comments: Submitted. 49 pages

arXiv:1702.07199 [pdf, ps, other]

Convergence acceleration of alternating series

Authors: Rafał Nowak

Abstract: We propose a new simple convergence acceleration method for wide range class of convergent alternating series. It has some common features with Smith's and Ford's modification of Levin's and Weniger's sequence transformations, but its computational and memory cost is lower. We compare all three methods and give some common theoretical results. Numerical examples confirm a similar performance of al… ▽ More We propose a new simple convergence acceleration method for wide range class of convergent alternating series. It has some common features with Smith's and Ford's modification of Levin's and Weniger's sequence transformations, but its computational and memory cost is lower. We compare all three methods and give some common theoretical results. Numerical examples confirm a similar performance of all of them. △ Less

Submitted 29 April, 2018; v1 submitted 23 February, 2017; originally announced February 2017.

MSC Class: 65B10 ACM Class: G.1.0; G.1.10

arXiv:1602.08895 [pdf, ps, other]

New properties of a certain method of summation of generalized hypergeometric series

Authors: Rafał Nowak, Paweł Woźny

Abstract: In a recent paper (Appl. Math. Comput. 215, 1622--1645, 2009), the authors proposed a method of summation of some slowly convergent series. The purpose of this note is to give more theoretical analysis for this transformation, including the convergence acceleration theorem in the case of summation of generalized hypergeometric series. Some new theoretical results and illustrative numerical example… ▽ More In a recent paper (Appl. Math. Comput. 215, 1622--1645, 2009), the authors proposed a method of summation of some slowly convergent series. The purpose of this note is to give more theoretical analysis for this transformation, including the convergence acceleration theorem in the case of summation of generalized hypergeometric series. Some new theoretical results and illustrative numerical examples are given. △ Less

Submitted 5 September, 2016; v1 submitted 29 February, 2016; originally announced February 2016.

Comments: revised version

MSC Class: 65B10; 33F05 ACM Class: G.1.0; G.1.2; G.1.10

arXiv:1503.02596 [pdf, ps, other]

doi 10.1109/JSTSP.2016.2537145

A Characterization of Deterministic Sampling Patterns for Low-Rank Matrix Completion

Authors: Daniel L. Pimentel-Alarcón, Nigel Boston, Robert D. Nowak

Abstract: Low-rank matrix completion (LRMC) problems arise in a wide variety of applications. Previous theory mainly provides conditions for completion under missing-at-random samplings. This paper studies deterministic conditions for completion. An incomplete $d \times N$ matrix is finitely rank-$r$ completable if there are at most finitely many rank-$r$ matrices that agree with all its observed entries. F… ▽ More Low-rank matrix completion (LRMC) problems arise in a wide variety of applications. Previous theory mainly provides conditions for completion under missing-at-random samplings. This paper studies deterministic conditions for completion. An incomplete $d \times N$ matrix is finitely rank-$r$ completable if there are at most finitely many rank-$r$ matrices that agree with all its observed entries. Finite completability is the tipping point in LRMC, as a few additional samples of a finitely completable matrix guarantee its unique completability. The main contribution of this paper is a deterministic sampling condition for finite completability. We use this to also derive deterministic sampling conditions for unique completability that can be efficiently verified. We also show that under uniform random sampling schemes, these conditions are satisfied with high probability if $O(\max\{r,\log d\})$ entries per column are observed. These findings have several implications on LRMC regarding lower bounds, sample and computational complexity, the role of coherence, adaptive settings and the validation of any completion algorithm. We complement our theoretical results with experiments that support our findings and motivate future analysis of uncharted sampling regimes. △ Less

Submitted 11 October, 2016; v1 submitted 9 March, 2015; originally announced March 2015.

Comments: This update corrects an error in version 2 of this paper, where we erroneously assumed that columns with more than r+1 observed entries would yield multiple independent constraints

Journal ref: IEEE Journal of Selected Topics in Signal Processing, vol. 10, no. 4, pp. 623-636, June, 2016

arXiv:1410.0633 [pdf, ps, other]

Deterministic Conditions for Subspace Identifiability from Incomplete Sampling

Authors: Daniel L. Pimentel-Alarcón, Robert D. Nowak, Nigel Boston

Abstract: Consider a generic $r$-dimensional subspace of $\mathbb{R}^d$, $r<d$, and suppose that we are only given projections of this subspace onto small subsets of the canonical coordinates. The paper establishes necessary and sufficient deterministic conditions on the subsets for subspace identifiability. Consider a generic $r$-dimensional subspace of $\mathbb{R}^d$, $r<d$, and suppose that we are only given projections of this subspace onto small subsets of the canonical coordinates. The paper establishes necessary and sufficient deterministic conditions on the subsets for subspace identifiability. △ Less

Submitted 24 May, 2015; v1 submitted 2 October, 2014; originally announced October 2014.

Comments: To appear in Proc. of IEEE ISIT, 2015

arXiv:1404.7055 [pdf, other]

doi 10.1109/TCBB.2014.2361685

Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method

Authors: Gautam Dasarathy, Robert Nowak, Sebastien Roch

Abstract: We consider the problem of estimating the evolutionary history of a set of species (phylogeny or species tree) from several genes. It is known that the evolutionary history of individual genes (gene trees) might be topologically distinct from each other and from the underlying species tree, possibly confounding phylogenetic analysis. A further complication in practice is that one has to estimate g… ▽ More We consider the problem of estimating the evolutionary history of a set of species (phylogeny or species tree) from several genes. It is known that the evolutionary history of individual genes (gene trees) might be topologically distinct from each other and from the underlying species tree, possibly confounding phylogenetic analysis. A further complication in practice is that one has to estimate gene trees from molecular sequences of finite length. We provide the first full data-requirement analysis of a species tree reconstruction method that takes into account estimation errors at the gene level. Under that criterion, we also devise a novel reconstruction algorithm that provably improves over all previous methods in a regime of interest. △ Less

Submitted 30 June, 2014; v1 submitted 28 April, 2014; originally announced April 2014.

Comments: 19 pages, 2 figures. Preliminary version to appear in IEEE ISIT 2014. Added acknowledgements and made the proof of the "equality" part of Theorem 3 explicit in Appendix C

arXiv:1404.3418 [pdf, ps, other]

Active Learning for Undirected Graphical Model Selection

Authors: Divyanshu Vats, Robert D. Nowak, Richard G. Baraniuk

Abstract: This paper studies graphical model selection, i.e., the problem of estimating a graph of statistical relationships among a collection of random variables. Conventional graphical model selection algorithms are passive, i.e., they require all the measurements to have been collected before processing begins. We propose an active learning algorithm that uses junction tree representations to adapt futu… ▽ More This paper studies graphical model selection, i.e., the problem of estimating a graph of statistical relationships among a collection of random variables. Conventional graphical model selection algorithms are passive, i.e., they require all the measurements to have been collected before processing begins. We propose an active learning algorithm that uses junction tree representations to adapt future measurements based on the information gathered from prior measurements. We prove that, under certain conditions, our active learning algorithm requires fewer scalar measurements than any passive algorithm to reliably estimate a graph. A range of numerical results validate our theory and demonstrates the benefits of active learning. △ Less

Submitted 13 April, 2014; originally announced April 2014.

Comments: AISTATS 2014

Journal ref: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS) 2014, Reykjavik, Iceland. JMLR: W&CP volume 33

arXiv:1303.6544 [pdf, other]

Sketching Sparse Matrices

Authors: Gautam Dasarathy, Parikshit Shah, Badri Narayan Bhaskar, Robert Nowak

Abstract: This paper considers the problem of recovering an unknown sparse p\times p matrix X from an m\times m matrix Y=AXB^T, where A and B are known m \times p matrices with m << p. The main result shows that there exist constructions of the "sketching" matrices A and B so that even if X has O(p) non-zeros, it can be recovered exactly and efficiently using a convex program as long as these non-zeros ar… ▽ More This paper considers the problem of recovering an unknown sparse p\times p matrix X from an m\times m matrix Y=AXB^T, where A and B are known m \times p matrices with m << p. The main result shows that there exist constructions of the "sketching" matrices A and B so that even if X has O(p) non-zeros, it can be recovered exactly and efficiently using a convex program as long as these non-zeros are not concentrated in any single row/column of X. Furthermore, it suffices for the size of Y (the sketch dimension) to scale as m = O(\sqrt{# nonzeros in X} \times log p). The results also show that the recovery is robust and stable in the sense that if X is equal to a sparse matrix plus a perturbation, then the convex program we propose produces an approximation with accuracy proportional to the size of the perturbation. Unlike traditional results on sparse recovery, where the sensing matrix produces independent measurements, our sensing operator is highly constrained (it assumes a tensor product structure). Therefore, proving recovery guarantees require non-standard techniques. Indeed our approach relies on a novel result concerning tensor products of bipartite graphs, which may be of independent interest. This problem is motivated by the following application, among others. Consider a p\times n data matrix D, consisting of n observations of p variables. Assume that the correlation matrix X:=DD^{T} is (approximately) sparse in the sense that each of the p variables is significantly correlated with only a few others. Our results show that these significant correlations can be detected even if we have access to only a sketch of the data S=AD with A \in R^{m\times p}. △ Less

Submitted 26 March, 2013; originally announced March 2013.

arXiv:1209.3079 [pdf, other]

Signal Recovery in Unions of Subspaces with Applications to Compressive Imaging

Authors: Nikhil Rao, Benjamin Recht, Robert Nowak

Abstract: In applications ranging from communications to genetics, signals can be modeled as lying in a union of subspaces. Under this model, signal coefficients that lie in certain subspaces are active or inactive together. The potential subspaces are known in advance, but the particular set of subspaces that are active (i.e., in the signal support) must be learned from measurements. We show that exploitin… ▽ More In applications ranging from communications to genetics, signals can be modeled as lying in a union of subspaces. Under this model, signal coefficients that lie in certain subspaces are active or inactive together. The potential subspaces are known in advance, but the particular set of subspaces that are active (i.e., in the signal support) must be learned from measurements. We show that exploiting knowledge of subspaces can further reduce the number of measurements required for exact signal recovery, and derive universal bounds for the number of measurements needed. The bound is universal in the sense that it only depends on the number of subspaces under consideration, and their orientation relative to each other. The particulars of the subspaces (e.g., compositions, dimensions, extents, overlaps, etc.) does not affect the results we obtain. In the process, we derive sample complexity bounds for the special case of the group lasso with overlapping groups (the latent group lasso), which is used in a variety of applications. Finally, we also show that wavelet transform coefficients of images can be modeled as lying in groups, and hence can be efficiently recovered using group lasso methods. △ Less

Submitted 13 September, 2012; originally announced September 2012.

Comments: arXiv admin note: substantial text overlap with arXiv:1106.4355

arXiv:1108.3367 [pdf, other]

On the convergence acceleration of some continued fractions

Authors: Rafał Nowak

Abstract: A well known method for convergence acceleration of continued fraction $\K(a_n/b_n)$ is to use the modified approximants $S_n(ω_n)$ in place of the classical approximants $S_n(0)$, where $ω_n$ are close to tails $f^{(n)}$ of continued fraction. Recently, author proposed a method of iterative character producing tail approximations whose asymptotic expansion's accuracy is improving in each step. Th… ▽ More A well known method for convergence acceleration of continued fraction $\K(a_n/b_n)$ is to use the modified approximants $S_n(ω_n)$ in place of the classical approximants $S_n(0)$, where $ω_n$ are close to tails $f^{(n)}$ of continued fraction. Recently, author proposed a method of iterative character producing tail approximations whose asymptotic expansion's accuracy is improving in each step. This method can be applied to continued fractions $\K(a_n/b_n)$, where $a_n$, $b_n$ are polynomials in $n$ ($°a_n=2$, $°b_n\leq 1$) for sufficiently large $n$. The purpose of this paper is to extend this idea for the class of continued fractions $\K(a_n/b_n + a_n'/b_n')$, where $a_n$, $a_n'$, $b_n$, $b_n'$ are polynomials in $n$ ($°a_n=°a_n', °b_n=°b_n'$). We give examples involving such continued fraction expansions of some mathematical constants, as well as elementary and special functions. △ Less

Submitted 4 March, 2012; v1 submitted 16 August, 2011; originally announced August 2011.

Comments: English improved

MSC Class: 65B99; 33F05 ACM Class: G.1.0; G.1.2; G.1.10

arXiv:1105.4540 [pdf, ps, other]

On the Limits of Sequential Testing in High Dimensions

Authors: Matthew Malloy, Robert Nowak

Abstract: This paper presents results pertaining to sequential methods for support recovery of sparse signals in noise. Specifically, we show that any sequential measurement procedure fails provided the average number of measurements per dimension grows slower then log s / D(f0||f1) where s is the level of sparsity, and D(f0||f1) the Kullback-Leibler divergence between the underlying distributions. For comp… ▽ More This paper presents results pertaining to sequential methods for support recovery of sparse signals in noise. Specifically, we show that any sequential measurement procedure fails provided the average number of measurements per dimension grows slower then log s / D(f0||f1) where s is the level of sparsity, and D(f0||f1) the Kullback-Leibler divergence between the underlying distributions. For comparison, we show any non-sequential procedure fails provided the number of measurements grows at a rate less than log n / D(f1||f0), where n is the total dimension of the problem. Lastly, we show that a simple procedure termed sequential thresholding guarantees exact support recovery provided the average number of measurements per dimension grows faster than (log s + log log n) / D(f0||f1), a mere additive factor more than the lower bound. △ Less

Submitted 18 October, 2011; v1 submitted 23 May, 2011; originally announced May 2011.

Comments: Asilomar 2011

arXiv:1103.5991 [pdf, ps, other]

Sequential Analysis in High Dimensional Multiple Testing and Sparse Recovery

Authors: Matthew Malloy, Robert Nowak

Abstract: This paper studies the problem of high-dimensional multiple testing and sparse recovery from the perspective of sequential analysis. In this setting, the probability of error is a function of the dimension of the problem. A simple sequential testing procedure is proposed. We derive necessary conditions for reliable recovery in the non-sequential setting and contrast them with sufficient conditions… ▽ More This paper studies the problem of high-dimensional multiple testing and sparse recovery from the perspective of sequential analysis. In this setting, the probability of error is a function of the dimension of the problem. A simple sequential testing procedure is proposed. We derive necessary conditions for reliable recovery in the non-sequential setting and contrast them with sufficient conditions for reliable recovery using the proposed sequential testing procedure. Applications of the main results to several commonly encountered models show that sequential testing can be exponentially more sensitive to the difference between the null and alternative distributions (in terms of the dependence on dimension), implying that subtle cases can be much more reliably determined using sequential methods. △ Less

Submitted 3 June, 2011; v1 submitted 30 March, 2011; originally announced March 2011.

Comments: Submitted to ISIT 2011

arXiv:1006.4046 [pdf, other]

Online Identification and Tracking of Subspaces from Highly Incomplete Information

Authors: Laura Balzano, Robert Nowak, Benjamin Recht

Abstract: This work presents GROUSE (Grassmanian Rank-One Update Subspace Estimation), an efficient online algorithm for tracking subspaces from highly incomplete observations. GROUSE requires only basic linear algebraic manipulations at each iteration, and each subspace update can be performed in linear time in the dimension of the subspace. The algorithm is derived by analyzing incremental gradient descen… ▽ More This work presents GROUSE (Grassmanian Rank-One Update Subspace Estimation), an efficient online algorithm for tracking subspaces from highly incomplete observations. GROUSE requires only basic linear algebraic manipulations at each iteration, and each subspace update can be performed in linear time in the dimension of the subspace. The algorithm is derived by analyzing incremental gradient descent on the Grassmannian manifold of subspaces. With a slight modification, GROUSE can also be used as an online incremental algorithm for the matrix completion problem of imputing missing entries of a low-rank matrix. GROUSE performs exceptionally well in practice both in tracking subspaces and as an online algorithm for matrix completion. △ Less

Submitted 12 July, 2011; v1 submitted 21 June, 2010; originally announced June 2010.

arXiv:1003.0205 [pdf, other]

Detecting Weak but Hierarchically-Structured Patterns in Networks

Authors: Aarti Singh, Robert D. Nowak, Robert Calderbank

Abstract: The ability to detect weak distributed activation patterns in networks is critical to several applications, such as identifying the onset of anomalous activity or incipient congestion in the Internet, or faint traces of a biochemical spread by a sensor network. This is a challenging problem since weak distributed patterns can be invisible in per node statistics as well as a global network-wide a… ▽ More The ability to detect weak distributed activation patterns in networks is critical to several applications, such as identifying the onset of anomalous activity or incipient congestion in the Internet, or faint traces of a biochemical spread by a sensor network. This is a challenging problem since weak distributed patterns can be invisible in per node statistics as well as a global network-wide aggregate. Most prior work considers situations in which the activation/non-activation of each node is statistically independent, but this is unrealistic in many problems. In this paper, we consider structured patterns arising from statistical dependencies in the activation process. Our contributions are three-fold. First, we propose a sparsifying transform that succinctly represents structured activation patterns that conform to a hierarchical dependency graph. Second, we establish that the proposed transform facilitates detection of very weak activation patterns that cannot be detected with existing methods. Third, we show that the structure of the hierarchical dependency graph governing the activation process, and hence the network transform, can be learnt from very few (logarithmic in network size) independent snapshots of network activity. △ Less

Submitted 28 February, 2010; originally announced March 2010.

arXiv:1001.5311 [pdf, ps, other]

Distilled Sensing: Adaptive Sampling for Sparse Detection and Estimation

Authors: Jarvis Haupt, Rui Castro, Robert Nowak

Abstract: Adaptive sampling results in dramatic improvements in the recovery of sparse signals in white Gaussian noise. A sequential adaptive sampling-and-refinement procedure called Distilled Sensing (DS) is proposed and analyzed. DS is a form of multi-stage experimental design and testing. Because of the adaptive nature of the data collection, DS can detect and localize far weaker signals than possible fr… ▽ More Adaptive sampling results in dramatic improvements in the recovery of sparse signals in white Gaussian noise. A sequential adaptive sampling-and-refinement procedure called Distilled Sensing (DS) is proposed and analyzed. DS is a form of multi-stage experimental design and testing. Because of the adaptive nature of the data collection, DS can detect and localize far weaker signals than possible from non-adaptive measurements. In particular, reliable detection and localization (support estimation) using non-adaptive samples is possible only if the signal amplitudes grow logarithmically with the problem dimension. Here it is shown that using adaptive sampling, reliable detection is possible provided the amplitude exceeds a constant, and localization is possible when the amplitude exceeds any arbitrarily slowly growing function of the dimension. △ Less

Submitted 27 May, 2010; v1 submitted 29 January, 2010; originally announced January 2010.

Comments: 23 pages, 2 figures. Revision includes minor clarifications, along with more illustrative experimental results (cf. Figure 2)

Report number: Rice University ECE Technical Report TREE1001

arXiv:0910.4397 [pdf, other]

The Geometry of Generalized Binary Search

Authors: Robert D. Nowak

Abstract: This paper investigates the problem of determining a binary-valued function through a sequence of strategically selected queries. The focus is an algorithm called Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued function through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses… ▽ More This paper investigates the problem of determining a binary-valued function through a sequence of strategically selected queries. The focus is an algorithm called Generalized Binary Search (GBS). GBS is a well-known greedy algorithm for determining a binary-valued function through a sequence of strategically selected queries. At each step, a query is selected that most evenly splits the hypotheses under consideration into two disjoint subsets, a natural generalization of the idea underlying classic binary search. This paper develops novel incoherence and geometric conditions under which GBS achieves the information-theoretically optimal query complexity; i.e., given a collection of N hypotheses, GBS terminates with the correct function after no more than a constant times log N queries. Furthermore, a noise-tolerant version of GBS is developed that also achieves the optimal query complexity. These results are applied to learning halfspaces, a problem arising routinely in image processing and machine learning. △ Less

Submitted 25 June, 2013; v1 submitted 22 October, 2009; originally announced October 2009.

Comments: corrected typo in Thm 3

arXiv:0908.3593 [pdf, ps, other]

doi 10.1214/08-AOS661

Adaptive Hausdorff estimation of density level sets

Authors: Aarti Singh, Clayton Scott, Robert Nowak

Abstract: Consider the problem of estimating the $γ$-level set $G^*_γ=\{x:f(x)\geqγ\}$ of an unknown $d$-dimensional density function $f$ based on $n$ independent observations $X_1,...,X_n$ from the density. This problem has been addressed under global error criteria related to the symmetric set difference. However, in certain applications a spatially uniform mode of convergence is desirable to ensure tha… ▽ More Consider the problem of estimating the $γ$-level set $G^*_γ=\{x:f(x)\geqγ\}$ of an unknown $d$-dimensional density function $f$ based on $n$ independent observations $X_1,...,X_n$ from the density. This problem has been addressed under global error criteria related to the symmetric set difference. However, in certain applications a spatially uniform mode of convergence is desirable to ensure that the estimated set is close to the target set everywhere. The Hausdorff error criterion provides this degree of uniformity and, hence, is more appropriate in such situations. It is known that the minimax optimal rate of error convergence for the Hausdorff metric is $(n/\log n)^{-1/(d+2α)}$ for level sets with boundaries that have a Lipschitz functional form, where the parameter $α$ characterizes the regularity of the density around the level of interest. However, the estimators proposed in previous work are nonadaptive to the density regularity and require knowledge of the parameter $α$. Furthermore, previously developed estimators achieve the minimax optimal rate for rather restricted classes of sets (e.g., the boundary fragment and star-shaped sets) that effectively reduce the set estimation problem to a function estimation problem. This characterization precludes level sets with multiple connected components, which are fundamental to many applications. This paper presents a fully data-driven procedure that is adaptive to unknown regularity conditions and achieves near minimax optimal Hausdorff error control for a class of density level sets with very general shapes and multiple connected components. △ Less

Submitted 25 August, 2009; originally announced August 2009.

Comments: Published in at http://dx.doi.org/10.1214/08-AOS661 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS661 MSC Class: 62G05; 62G20 (Primary)

Journal ref: Annals of Statistics 2009, Vol. 37, No. 5B, 2760-2782

arXiv:math/0406424 [pdf, ps, other]

doi 10.1214/009053604000000076

Multiscale likelihood analysis and complexity penalized estimation

Authors: Eric D. Kolaczyk, Robert D. Nowak

Abstract: We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L^2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of suffi… ▽ More We describe here a framework for a certain class of multiscale likelihood factorizations wherein, in analogy to a wavelet decomposition of an L^2 function, a given likelihood function has an alternative representation as a product of conditional densities reflecting information in both the data and the parameter vector localized in position and scale. The framework is developed as a set of sufficient conditions for the existence of such factorizations, formulated in analogy to those underlying a standard multiresolution analysis for wavelets, and hence can be viewed as a multiresolution analysis for likelihoods. We then consider the use of these factorizations in the task of nonparametric, complexity penalized likelihood estimation. We study the risk properties of certain thresholding and partitioning estimators, and demonstrate their adaptivity and near-optimality, in a minimax sense over a broad range of function spaces, based on squared Hellinger distance as a loss function. In particular, our results provide an illustration of how properties of classical wavelet-based estimators can be obtained in a single, unified framework that includes models for continuous, count and categorical data types. △ Less

Submitted 22 June, 2004; originally announced June 2004.

Report number: IMS-AOS-AOS140 MSC Class: 62C20; 62G05 (Primary) 60E05 (Secondary)

Journal ref: Annals of Statistics 2004, Vol. 32, No. 2, 500-527

Showing 1–28 of 28 results for author: Nowak, R