Search | arXiv e-print repository

On the Algorithmic Stability and Generalization of Adaptive Optimization Methods

Authors: Han Nguyen, Hai Pham, Sashank J. Reddi, Barnabás Póczos

Abstract: Despite their popularity in deep learning and machine learning in general, the theoretical properties of adaptive optimizers such as Adagrad, RMSProp, Adam or AdamW are not yet fully understood. In this paper, we develop a novel framework to study the stability and generalization of these optimization methods. Based on this framework, we show provable guarantees about such properties that depend h… ▽ More Despite their popularity in deep learning and machine learning in general, the theoretical properties of adaptive optimizers such as Adagrad, RMSProp, Adam or AdamW are not yet fully understood. In this paper, we develop a novel framework to study the stability and generalization of these optimization methods. Based on this framework, we show provable guarantees about such properties that depend heavily on a single parameter $β_2$. Our empirical experiments support our claims and provide practical insights into the stability and generalization properties of adaptive optimization methods. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: 21 pages including appendix

arXiv:2004.08597 [pdf, other]

Robust Density Estimation under Besov IPM Losses

Authors: Ananya Uppal, Shashank Singh, Barnabas Poczos

Abstract: We study minimax convergence rates of nonparametric density estimation in the Huber contamination model, in which a proportion of the data comes from an unknown outlier distribution. We provide the first results for this problem under a large family of losses, called Besov integral probability metrics (IPMs), that includes $\mathcal{L}^p$, Wasserstein, Kolmogorov-Smirnov, and other common distance… ▽ More We study minimax convergence rates of nonparametric density estimation in the Huber contamination model, in which a proportion of the data comes from an unknown outlier distribution. We provide the first results for this problem under a large family of losses, called Besov integral probability metrics (IPMs), that includes $\mathcal{L}^p$, Wasserstein, Kolmogorov-Smirnov, and other common distances between probability distributions. Specifically, under a range of smoothness assumptions on the population and outlier distributions, we show that a re-scaled thresholding wavelet series estimator achieves minimax optimal convergence rates under a wide variety of losses. Finally, based on connections that have recently been shown between nonparametric density estimation under IPM losses and generative adversarial networks (GANs), we show that certain GAN architectures also achieve these minimax rates. △ Less

Submitted 6 September, 2021; v1 submitted 18 April, 2020; originally announced April 2020.

arXiv:2002.08528 [pdf, other]

Adaptive Sampling Distributed Stochastic Variance Reduced Gradient for Heterogeneous Distributed Datasets

Authors: Ilqar Ramazanli, Han Nguyen, Hai Pham, Sashank J. Reddi, Barnabas Poczos

Abstract: We study distributed optimization algorithms for minimizing the average of \emph{heterogeneous} functions distributed across several machines with a focus on communication efficiency. In such settings, naively using the classical stochastic gradient descent (SGD) or its variants (e.g., SVRG) with a uniform sampling of machines typically yields poor performance. It often leads to the dependence of… ▽ More We study distributed optimization algorithms for minimizing the average of \emph{heterogeneous} functions distributed across several machines with a focus on communication efficiency. In such settings, naively using the classical stochastic gradient descent (SGD) or its variants (e.g., SVRG) with a uniform sampling of machines typically yields poor performance. It often leads to the dependence of convergence rate on maximum Lipschitz constant of gradients across the devices. In this paper, we propose a novel \emph{adaptive} sampling of machines specially catered to these settings. Our method relies on an adaptive estimate of local Lipschitz constants base on the information of past gradients. We show that the new way improves the dependence of convergence rate from maximum Lipschitz constant to \emph{average} Lipschitz constant across machines, thereby, significantly accelerating the convergence. Our experiments demonstrate that our method indeed speeds up the convergence of the standard SVRG algorithm in heterogeneous environments. △ Less

Submitted 17 November, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

arXiv:1902.03511 [pdf, other]

Nonparametric Density Estimation & Convergence Rates for GANs under Besov IPM Losses

Authors: Ananya Uppal, Shashank Singh, Barnabás Póczos

Abstract: We study the problem of estimating a nonparametric probability density under a large family of losses called Besov IPMs, which include, for example, $\mathcal{L}^p$ distances, total variation distance, and generalizations of both Wasserstein and Kolmogorov-Smirnov distances. For a wide variety of settings, we provide both lower and upper bounds, identifying precisely how the choice of loss functio… ▽ More We study the problem of estimating a nonparametric probability density under a large family of losses called Besov IPMs, which include, for example, $\mathcal{L}^p$ distances, total variation distance, and generalizations of both Wasserstein and Kolmogorov-Smirnov distances. For a wide variety of settings, we provide both lower and upper bounds, identifying precisely how the choice of loss function and assumptions on the data interact to determine the minimax optimal convergence rate. We also show that linear distribution estimates, such as the empirical distribution or kernel density estimator, often fail to converge at the optimal rate. Our bounds generalize, unify, or improve several recent and classical results. Moreover, IPMs can be used to formalize a statistical model of generative adversarial networks (GANs). Thus, we show how our results imply bounds on the statistical error of a GAN, showing, for example, that GANs can strictly outperform the best linear estimator. △ Less

Submitted 13 January, 2020; v1 submitted 9 February, 2019; originally announced February 2019.

Comments: Advances in Neural Information Processing Systems. 2019

arXiv:1810.02054 [pdf, other]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Authors: Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh

Abstract: One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and… ▽ More One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. We believe these insights are also useful in analyzing deep models and other first order methods. △ Less

Submitted 4 February, 2019; v1 submitted 4 October, 2018; originally announced October 2018.

Comments: ICLR 2019

arXiv:1805.08836 [pdf, other]

Nonparametric Density Estimation under Adversarial Losses

Authors: Shashank Singh, Ananya Uppal, Boyue Li, Chun-Liang Li, Manzil Zaheer, Barnabás Póczos

Abstract: We study minimax convergence rates of nonparametric density estimation under a large class of loss functions called "adversarial losses", which, besides classical $\mathcal{L}^p$ losses, includes maximum mean discrepancy (MMD), Wasserstein distance, and total variation distance. These losses are closely related to the losses encoded by discriminator networks in generative adversarial networks (GAN… ▽ More We study minimax convergence rates of nonparametric density estimation under a large class of loss functions called "adversarial losses", which, besides classical $\mathcal{L}^p$ losses, includes maximum mean discrepancy (MMD), Wasserstein distance, and total variation distance. These losses are closely related to the losses encoded by discriminator networks in generative adversarial networks (GANs). In a general framework, we study how the choice of loss and the assumed smoothness of the underlying density together determine the minimax rate. We also discuss implications for training GANs based on deep ReLU networks, and more general connections to learning implicit generative models in a minimax statistical sense. △ Less

Submitted 28 October, 2018; v1 submitted 22 May, 2018; originally announced May 2018.

arXiv:1803.11451 [pdf, ps, other]

Minimax Estimation of Quadratic Fourier Functionals

Authors: Shashank Singh, Bharath K. Sriperumbudur, Barnabás Póczos

Abstract: We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these qu… ▽ More We study estimation of (semi-)inner products between two nonparametric probability distributions, given IID samples from each distribution. These products include relatively well-studied classical $\mathcal{L}^2$ and Sobolev inner products, as well as those induced by translation-invariant reproducing kernels, for which we believe our results are the first. We first propose estimators for these quantities, and the induced (semi)norms and (pseudo)metrics. We then prove non-asymptotic upper bounds on their mean squared error, in terms of weights both of the inner product and of the two distributions, in the Fourier basis. Finally, we prove minimax lower bounds that imply rate-optimality of the proposed estimators over Fourier ellipsoids. △ Less

Submitted 1 September, 2018; v1 submitted 30 March, 2018; originally announced March 2018.

arXiv:1802.08855 [pdf, ps, other]

Minimax Distribution Estimation in Wasserstein Distance

Authors: Shashank Singh, Barnabás Póczos

Abstract: The Wasserstein metric is an important measure of distance between probability distributions, with applications in machine learning, statistics, probability theory, and data analysis. This paper provides upper and lower bounds on statistical minimax rates for the problem of estimating a probability distribution under Wasserstein loss, using only metric properties, such as covering and packing numb… ▽ More The Wasserstein metric is an important measure of distance between probability distributions, with applications in machine learning, statistics, probability theory, and data analysis. This paper provides upper and lower bounds on statistical minimax rates for the problem of estimating a probability distribution under Wasserstein loss, using only metric properties, such as covering and packing numbers, of the sample space, and weak moment assumptions on the probability distributions. △ Less

Submitted 6 November, 2019; v1 submitted 24 February, 2018; originally announced February 2018.

arXiv:1712.00779 [pdf, other]

Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

Authors: Simon S. Du, Jason D. Lee, Yuandong Tian, Barnabas Poczos, Aarti Singh

Abstract: We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j a_jσ(\mathbf{w}^T\mathbf{Z}_j)$, in which both the convolutional weights $\mathbf{w}$ and the output weights $\mathbf{a}$ are parameters to be learned. When the labels are the outputs from a teacher network of the… ▽ More We consider the problem of learning a one-hidden-layer neural network with non-overlapping convolutional layer and ReLU activation, i.e., $f(\mathbf{Z}, \mathbf{w}, \mathbf{a}) = \sum_j a_jσ(\mathbf{w}^T\mathbf{Z}_j)$, in which both the convolutional weights $\mathbf{w}$ and the output weights $\mathbf{a}$ are parameters to be learned. When the labels are the outputs from a teacher network of the same architecture with fixed weights $(\mathbf{w}^*, \mathbf{a}^*)$, we prove that with Gaussian input $\mathbf{Z}$, there is a spurious local minimizer. Surprisingly, in the presence of the spurious local minimizer, gradient descent with weight normalization from randomly initialized weights can still be proven to recover the true parameters with constant probability, which can be boosted to probability $1$ with multiple restarts. We also show that with constant probability, the same procedure could also converge to the spurious local minimum, showing that the local minimum plays a non-trivial role in the dynamics of gradient descent. Furthermore, a quantitative analysis shows that the gradient descent dynamics has two phases: it starts off slow, but converges much faster after several iterations. △ Less

Submitted 14 June, 2018; v1 submitted 3 December, 2017; originally announced December 2017.

Comments: Accepted by ICML 2018

arXiv:1708.08587 [pdf, other]

On the Reconstruction Risk of Convolutional Sparse Dictionary Learning

Authors: Shashank Singh, Barnabás Póczos, Jian Ma

Abstract: Sparse dictionary learning (SDL) has become a popular method for adaptively identifying parsimonious representations of a dataset, a fundamental problem in machine learning and signal processing. While most work on SDL assumes a training dataset of independent and identically distributed samples, a variant known as convolutional sparse dictionary learning (CSDL) relaxes this assumption, allowing m… ▽ More Sparse dictionary learning (SDL) has become a popular method for adaptively identifying parsimonious representations of a dataset, a fundamental problem in machine learning and signal processing. While most work on SDL assumes a training dataset of independent and identically distributed samples, a variant known as convolutional sparse dictionary learning (CSDL) relaxes this assumption, allowing more general sequential data sources, such as time series or other dependent data. Although recent work has explored the statistical properties of classical SDL, the statistical properties of CSDL remain unstudied. This paper begins to study this by identifying the minimax convergence rate of CSDL in terms of reconstruction risk, by both upper bounding the risk of an established CSDL estimator and proving a matching information-theoretic lower bound. Our results indicate that consistency in reconstruction risk is possible precisely in the `ultra-sparse' setting, in which the sparsity (i.e., the number of feature occurrences) is in $o(N)$ in terms of the length N of the training sequence. Notably, our results make very weak assumptions, allowing arbitrary dictionaries and dependent measurement noise. Finally, we verify our theoretical results with numerical experiments on synthetic data. △ Less

Submitted 24 February, 2018; v1 submitted 29 August, 2017; originally announced August 2017.

arXiv:1705.10412 [pdf, other]

Gradient Descent Can Take Exponential Time to Escape Saddle Points

Authors: Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnabas Poczos, Aarti Singh

Abstract: Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not sl… ▽ More Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings. △ Less

Submitted 5 November, 2017; v1 submitted 29 May, 2017; originally announced May 2017.

Comments: Accepted by NIPS 2017

arXiv:1702.07803 [pdf, ps, other]

Nonparanormal Information Estimation

Authors: Shashank Singh, Barnabás Pøczos

Abstract: We study the problem of using i.i.d. samples from an unknown multivariate probability distribution $p$ to estimate the mutual information of $p$. This problem has recently received attention in two settings: (1) where $p$ is assumed to be Gaussian and (2) where $p$ is assumed only to lie in a large nonparametric smoothness class. Estimators proposed for the Gaussian case converge in high dimension… ▽ More We study the problem of using i.i.d. samples from an unknown multivariate probability distribution $p$ to estimate the mutual information of $p$. This problem has recently received attention in two settings: (1) where $p$ is assumed to be Gaussian and (2) where $p$ is assumed only to lie in a large nonparametric smoothness class. Estimators proposed for the Gaussian case converge in high dimensions when the Gaussian assumption holds, but are brittle, failing dramatically when $p$ is not Gaussian. Estimators proposed for the nonparametric case fail to converge with realistic sample sizes except in very low dimensions. As a result, there is a lack of robust mutual information estimators for many realistic data. To address this, we propose estimators for mutual information when $p$ is assumed to be a nonparanormal (a.k.a., Gaussian copula) model, a semiparametric compromise between Gaussian and nonparametric extremes. Using theoretical bounds and experiments, we show these estimators strike a practical balance between robustness and scaling with dimensionality. △ Less

Submitted 24 February, 2017; originally announced February 2017.

arXiv:1608.06879 [pdf, other]

AIDE: Fast and Communication Efficient Distributed Optimization

Authors: Sashank J. Reddi, Jakub Konečný, Peter Richtárik, Barnabás Póczós, Alex Smola

Abstract: In this paper, we present two new communication-efficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANE algorithm that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANE significantly. In fact, our approach can be… ▽ More In this paper, we present two new communication-efficient methods for distributed minimization of an average of functions. The first algorithm is an inexact variant of the DANE algorithm that allows any local algorithm to return an approximate solution to a local subproblem. We show that such a strategy does not affect the theoretical guarantees of DANE significantly. In fact, our approach can be viewed as a robustification strategy since the method is substantially better behaved than DANE on data partition arising in practice. It is well known that DANE algorithm does not match the communication complexity lower bounds. To bridge this gap, we propose an accelerated variant of the first method, called AIDE, that not only matches the communication lower bounds but can also be implemented using a purely first-order oracle. Our empirical results show that AIDE is superior to other communication efficient algorithms in settings that naturally arise in machine learning applications. △ Less

Submitted 24 August, 2016; originally announced August 2016.

arXiv:1607.08254 [pdf, other]

Stochastic Frank-Wolfe Methods for Nonconvex Optimization

Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization problems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limite… ▽ More We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization problems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their ability to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limited. In this paper, we propose nonconvex stochastic Frank-Wolfe methods and analyze their convergence properties. For objective functions that decompose into a finite-sum, we leverage ideas from variance reduction techniques for convex optimization to obtain new variance reduced nonconvex Frank-Wolfe methods that have provably faster convergence than the classical Frank-Wolfe method. Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting. △ Less

Submitted 29 July, 2016; v1 submitted 27 July, 2016; originally announced July 2016.

arXiv:1606.01554 [pdf, ps, other]

Finite-Sample Analysis of Fixed-k Nearest Neighbor Density Functional Estimators

Authors: Shashank Singh, Barnabás Póczos

Abstract: We provide finite-sample analysis of a general framework for using k-nearest neighbor statistics to estimate functionals of a nonparametric continuous probability density, including entropies and divergences. Rather than plugging a consistent density estimate (which requires $k \to \infty$ as the sample size $n \to \infty$) into the functional of interest, the estimators we consider fix k and perf… ▽ More We provide finite-sample analysis of a general framework for using k-nearest neighbor statistics to estimate functionals of a nonparametric continuous probability density, including entropies and divergences. Rather than plugging a consistent density estimate (which requires $k \to \infty$ as the sample size $n \to \infty$) into the functional of interest, the estimators we consider fix k and perform a bias correction. This is more efficient computationally, and, as we show in certain cases, statistically, leading to faster convergence rates. Our framework unifies several previous estimators, for most of which ours are the first finite sample guarantees. △ Less

Submitted 5 June, 2016; originally announced June 2016.

Comments: 16 pages, 0 figures

arXiv:1605.06900 [pdf, other]

Fast Stochastic Methods for Nonsmooth Nonconvex Optimization

Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonconvex part is smooth and the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle… ▽ More We analyze stochastic algorithms for optimizing nonconvex, nonsmooth finite-sum problems, where the nonconvex part is smooth and the nonsmooth part is convex. Surprisingly, unlike the smooth case, our knowledge of this fundamental problem is very limited. For example, it is not known whether the proximal stochastic gradient method with constant minibatch converges to a stationary point. To tackle this issue, we develop fast stochastic algorithms that provably converge to a stationary point for constant minibatches. Furthermore, using a variant of these algorithms, we show provably faster convergence than batch proximal gradient descent. Finally, we prove global linear convergence rate for an interesting subclass of nonsmooth nonconvex functions, that subsumes several recent works. This paper builds upon our recent series of papers on fast stochastic methods for smooth nonconvex optimization [22, 23], with a novel analysis for nonconvex and nonsmooth functions. △ Less

Submitted 23 May, 2016; originally announced May 2016.

arXiv:1605.05785 [pdf, other]

Efficient Nonparametric Smoothness Estimation

Authors: Shashank Singh, Simon S. Du, Barnabás Póczos

Abstract: Sobolev quantities (norms, inner products, and distances) of probability density functions are important in the theory of nonparametric statistics, but have rarely been used in practice, partly due to a lack of practical estimators. They also include, as special cases, $L^2$ quantities which are used in many applications. We propose and analyze a family of estimators for Sobolev quantities of unkn… ▽ More Sobolev quantities (norms, inner products, and distances) of probability density functions are important in the theory of nonparametric statistics, but have rarely been used in practice, partly due to a lack of practical estimators. They also include, as special cases, $L^2$ quantities which are used in many applications. We propose and analyze a family of estimators for Sobolev quantities of unknown probability density functions. We bound the bias and variance of our estimators over finite samples, finding that they are generally minimax rate-optimal. Our estimators are significantly more computationally tractable than previous estimators, and exhibit a statistical/computational trade-off allowing them to adapt to computational constraints. We also draw theoretical connections to recent work on fast two-sample testing. Finally, we empirically validate our estimators on synthetic data. △ Less

Submitted 21 July, 2016; v1 submitted 18 May, 2016; originally announced May 2016.

arXiv:1603.08589 [pdf, other]

Generalized Exponential Concentration Inequality for Rényi Divergence Estimation

Authors: Shashank Singh, Barnabás Póczos

Abstract: Estimating divergences in a consistent way is of great importance in many machine learning tasks. Although this is a fundamental problem in nonparametric statistics, to the best of our knowledge there has been no finite sample exponential inequality convergence bound derived for any divergence estimators. The main contribution of our work is to provide such a bound for an estimator of Rényi-$α$ di… ▽ More Estimating divergences in a consistent way is of great importance in many machine learning tasks. Although this is a fundamental problem in nonparametric statistics, to the best of our knowledge there has been no finite sample exponential inequality convergence bound derived for any divergence estimators. The main contribution of our work is to provide such a bound for an estimator of Rényi-$α$ divergence for a smooth Hölder class of densities on the $d$-dimensional unit cube $[0, 1]^d$. We also illustrate our theoretical results with a numerical experiment. △ Less

Submitted 28 March, 2016; originally announced March 2016.

Comments: In 31st International Conference on Machine Learning (ICML), 2014

arXiv:1603.08578 [pdf, ps, other]

Analysis of k-Nearest Neighbor Distances with Application to Entropy Estimation

Authors: Shashank Singh, Barnabás Póczos

Abstract: Estimating entropy and mutual information consistently is important for many machine learning applications. The Kozachenko-Leonenko (KL) estimator (Kozachenko & Leonenko, 1987) is a widely used nonparametric estimator for the entropy of multivariate continuous random variables, as well as the basis of the mutual information estimator of Kraskov et al. (2004), perhaps the most widely used estimator… ▽ More Estimating entropy and mutual information consistently is important for many machine learning applications. The Kozachenko-Leonenko (KL) estimator (Kozachenko & Leonenko, 1987) is a widely used nonparametric estimator for the entropy of multivariate continuous random variables, as well as the basis of the mutual information estimator of Kraskov et al. (2004), perhaps the most widely used estimator of mutual information in this setting. Despite the practical importance of these estimators, major theoretical questions regarding their finite-sample behavior remain open. This paper proves finite-sample bounds on the bias and variance of the KL estimator, showing that it achieves the minimax convergence rate for certain classes of smooth functions. In proving these bounds, we analyze finite-sample behavior of k-nearest neighbors (k-NN) distance statistics (on which the KL estimator is based). We derive concentration inequalities for k-NN distances and a general expectation bound for statistics of k-NN distances, which may be useful for other analyses of k-NN methods. △ Less

Submitted 21 July, 2016; v1 submitted 28 March, 2016; originally announced March 2016.

arXiv:1603.06160 [pdf, other]

Stochastic Variance Reduction for Nonconvex Optimization

Authors: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary po… ▽ More We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (SVRG) methods for them. SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (SGD); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary points) of SVRG for nonconvex optimization, and show that it is provably faster than SGD and gradient descent. We also analyze a subclass of nonconvex problems on which SVRG attains linear convergence to the global optimum. We extend our analysis to mini-batch variants of SVRG, showing (theoretical) linear speedup due to mini-batching in parallel settings. △ Less

Submitted 4 April, 2016; v1 submitted 19 March, 2016; originally announced March 2016.

Comments: Minor feedback changes

arXiv:1603.06159 [pdf, other]

Fast Incremental Method for Nonconvex Optimization

Authors: Sashank J. Reddi, Suvrit Sra, Barnabas Poczos, Alex Smola

Abstract: We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze the SAGA algorithm within an Incremental First-order Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconve… ▽ More We analyze a fast incremental aggregated gradient method for optimizing nonconvex problems of the form $\min_x \sum_i f_i(x)$. Specifically, we analyze the SAGA algorithm within an Incremental First-order Oracle framework, and show that it converges to a stationary point provably faster than both gradient descent and stochastic gradient descent. We also discuss a Polyak's special class of nonconvex problems for which SAGA converges at a linear rate to the global optimum. Finally, we analyze the practically valuable regularized and minibatch variants of SAGA. To our knowledge, this paper presents the first analysis of fast convergence for an incremental aggregated gradient method for nonconvex problems. △ Less

Submitted 19 March, 2016; originally announced March 2016.

arXiv:1509.08535 [pdf, other]

Boolean Matrix Factorization and Noisy Completion via Message Passing

Authors: Siamak Ravanbakhsh, Barnabas Poczos, Russell Greiner

Abstract: Boolean matrix factorization and Boolean matrix completion from noisy observations are desirable unsupervised data-analysis methods due to their interpretability, but hard to perform due to their NP-hardness. We treat these problems as maximum a posteriori inference problems in a graphical model and present a message passing approach that scales linearly with the number of observations and factors… ▽ More Boolean matrix factorization and Boolean matrix completion from noisy observations are desirable unsupervised data-analysis methods due to their interpretability, but hard to perform due to their NP-hardness. We treat these problems as maximum a posteriori inference problems in a graphical model and present a message passing approach that scales linearly with the number of observations and factors. Our empirical study demonstrates that message passing is able to recover low-rank Boolean matrices, in the boundaries of theoretically possible recovery and compares favorably with state-of-the-art in real-world applications, such collaborative filtering with large-scale Boolean data. △ Less

Submitted 4 February, 2016; v1 submitted 28 September, 2015; originally announced September 2015.

arXiv:1508.00655 [pdf, other]

Adaptivity and Computation-Statistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing

Authors: Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman

Abstract: Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for… ▽ More Nonparametric two sample testing is a decision theoretic problem that involves identifying differences between two random variables without making parametric assumptions about their underlying distributions. We refer to the most common settings as mean difference alternatives (MDA), for testing differences only in first moments, and general difference alternatives (GDA), which is about testing for any difference in distributions. A large number of test statistics have been proposed for both these settings. This paper connects three classes of statistics - high dimensional variants of Hotelling's t-test, statistics based on Reproducing Kernel Hilbert Spaces, and energy statistics based on pairwise distances. We ask the question: how much statistical power do popular kernel and distance based tests for GDA have when the unknown distributions differ in their means, compared to specialized tests for MDA? We formally characterize the power of popular tests for GDA like the Maximum Mean Discrepancy with the Gaussian kernel (gMMD) and bandwidth-dependent variants of the Energy Distance with the Euclidean norm (eED) in the high-dimensional MDA regime. Some practically important properties include (a) eED and gMMD have asymptotically equal power; furthermore they enjoy a free lunch because, while they are additionally consistent for GDA, they also have the same power as specialized high-dimensional t-test variants for MDA. All these tests are asymptotically optimal (including matching constants) under MDA for spherical covariances, according to simple lower bounds, (b) The power of gMMD is independent of the kernel bandwidth, as long as it is larger than the choice made by the median heuristic, (c) There is a clear and smooth computation-statistics tradeoff for linear-time, subquadratic-time and quadratic-time versions of these tests, with more computation resulting in higher power. △ Less

Submitted 4 August, 2015; originally announced August 2015.

Comments: 35 pages, 4 figures

arXiv:1505.04215 [pdf, other]

An Analysis of Active Learning With Uniform Feature Noise

Authors: Aaditya Ramdas, Barnabas Poczos, Aarti Singh, Larry Wasserman

Abstract: In active learning, the user sequentially chooses values for feature $X$ and an oracle returns the corresponding label $Y$. In this paper, we consider the effect of feature noise in active learning, which could arise either because $X$ itself is being measured, or it is corrupted in transmission to the oracle, or the oracle returns the label of a noisy version of the query point. In statistics, fe… ▽ More In active learning, the user sequentially chooses values for feature $X$ and an oracle returns the corresponding label $Y$. In this paper, we consider the effect of feature noise in active learning, which could arise either because $X$ itself is being measured, or it is corrupted in transmission to the oracle, or the oracle returns the label of a noisy version of the query point. In statistics, feature noise is known as "errors in variables" and has been studied extensively in non-active settings. However, the effect of feature noise in active learning has not been studied before. We consider the well-known Berkson errors-in-variables model with additive uniform noise of width $σ$. Our simple but revealing setting is that of one-dimensional binary classification setting where the goal is to learn a threshold (point where the probability of a $+$ label crosses half). We deal with regression functions that are antisymmetric in a region of size $σ$ around the threshold and also satisfy Tsybakov's margin condition around the threshold. We prove minimax lower and upper bounds which demonstrate that when $σ$ is smaller than the minimiax active/passive noiseless error derived in \cite{CN07}, then noise has no effect on the rates and one achieves the same noiseless rates. For larger $σ$, the \textit{unflattening} of the regression function on convolution with uniform noise, along with its local antisymmetry around the threshold, together yield a behaviour where noise \textit{appears} to be beneficial. Our key result is that active learning can buy significant improvement over a passive strategy even in the presence of feature noise. △ Less

Submitted 15 May, 2015; originally announced May 2015.

Comments: 24 pages, 2 figures, published in the proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), 2014

arXiv:1411.6314 [pdf, other]

On the High-dimensional Power of Linear-time Kernel Two-Sample Testing under Mean-difference Alternatives

Authors: Aaditya Ramdas, Sashank J. Reddi, Barnabas Poczos, Aarti Singh, Larry Wasserman

Abstract: Nonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives… ▽ More Nonparametric two sample testing deals with the question of consistently deciding if two distributions are different, given samples from both, without making any parametric assumptions about the form of the distributions. The current literature is split into two kinds of tests - those which are consistent without any assumptions about how the distributions may differ (\textit{general} alternatives), and those which are designed to specifically test easier alternatives, like a difference in means (\textit{mean-shift} alternatives). The main contribution of this paper is to explicitly characterize the power of a popular nonparametric two sample test, designed for general alternatives, under a mean-shift alternative in the high-dimensional setting. Specifically, we explicitly derive the power of the linear-time Maximum Mean Discrepancy statistic using the Gaussian kernel, where the dimension and sample size can both tend to infinity at any rate, and the two distributions differ in their means. As a corollary, we find that if the signal-to-noise ratio is held constant, then the test's power goes to one if the number of samples increases faster than the dimension increases. This is the first explicit power derivation for a general nonparametric test in the high-dimensional setting, and also the first analysis of how tests designed for general alternatives perform when faced with easier ones. △ Less

Submitted 23 November, 2014; originally announced November 2014.

Comments: 25 pages, 5 figures

arXiv:1411.2066 [pdf, ps, other]

Learning Theory for Distribution Regression

Authors: Zoltan Szabo, Bharath Sriperumbudur, Barnabas Poczos, Arthur Gretton

Abstract: We focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). Despite the large number of available heuristics in the literature, t… ▽ More We focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). Despite the large number of available heuristics in the literature, the inherent two-stage sampled nature of the problem makes the theoretical analysis quite challenging, since in practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between sets of points. To the best of our knowledge, the only existing technique with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which often performs poorly in practice), and the domain of the distributions to be compact Euclidean. In this paper, we study a simple, analytically computable, ridge regression-based alternative to distribution regression, where we embed the distributions to a reproducing kernel Hilbert space, and learn the regressor from the embeddings to the outputs. Our main contribution is to prove that this scheme is consistent in the two-stage sampled setup under mild conditions (on separable topological domains enriched with kernels): we present an exact computational-statistical efficiency trade-off analysis showing that our estimator is able to match the one-stage sampled minimax optimal rate [Caponnetto and De Vito, 2007; Steinwart et al., 2009]. This result answers a 17-year-old open question, establishing the consistency of the classical set kernel [Haussler, 1999; Gaertner et. al, 2002] in regression. We also cover consistency for more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010]. △ Less

Submitted 21 October, 2016; v1 submitted 7 November, 2014; originally announced November 2014.

Comments: Final version appeared at JMLR, with supplement. Code: https://bitbucket.org/szzoli/ite/. arXiv admin note: text overlap with arXiv:1402.1754

MSC Class: 62G08; 46E22; 47B32 ACM Class: G.3; I.2.6

Journal ref: Journal of Machine Learning Research, 17(152):1-40, 2016

arXiv:1406.2083 [pdf, other]

On the Decreasing Power of Kernel and Distance based Nonparametric Hypothesis Tests in High Dimensions

Authors: Sashank J. Reddi, Aaditya Ramdas, Barnabás Póczos, Aarti Singh, Larry Wasserman

Abstract: This paper is about two related decision theoretic problems, nonparametric two-sample testing and independence testing. There is a belief that two recently proposed solutions, based on kernels and distances between pairs of points, behave well in high-dimensional settings. We identify different sources of misconception that give rise to the above belief. Specifically, we differentiate the hardness… ▽ More This paper is about two related decision theoretic problems, nonparametric two-sample testing and independence testing. There is a belief that two recently proposed solutions, based on kernels and distances between pairs of points, behave well in high-dimensional settings. We identify different sources of misconception that give rise to the above belief. Specifically, we differentiate the hardness of estimation of test statistics from the hardness of testing whether these statistics are zero or not, and explicitly discuss a notion of "fair" alternative hypotheses for these problems as dimension increases. We then demonstrate that the power of these tests actually drops polynomially with increasing dimension against fair alternatives. We end with some theoretical insights and shed light on the \textit{median heuristic} for kernel bandwidth selection. Our work advances the current understanding of the power of modern nonparametric hypothesis tests in high dimensions. △ Less

Submitted 23 November, 2014; v1 submitted 9 June, 2014; originally announced June 2014.

Comments: 19 pages, 9 figures, published in AAAI-15: The 29th AAAI Conference on Artificial Intelligence (with author order reversed from ArXiv)

arXiv:1402.2966 [pdf, other]

Nonparametric Estimation of Renyi Divergence and Friends

Authors: Akshay Krishnamurthy, Kirthevasan Kandasamy, Barnabas Poczos, Larry Wasserman

Abstract: We consider nonparametric estimation of $L_2$, Renyi-$α$ and Tsallis-$α$ divergences between continuous distributions. Our approach is to construct estimators for particular integral functionals of two densities and translate them into divergence estimators. For the integral functionals, our estimators are based on corrections of a preliminary plug-in estimator. We show that these estimators achie… ▽ More We consider nonparametric estimation of $L_2$, Renyi-$α$ and Tsallis-$α$ divergences between continuous distributions. Our approach is to construct estimators for particular integral functionals of two densities and translate them into divergence estimators. For the integral functionals, our estimators are based on corrections of a preliminary plug-in estimator. We show that these estimators achieve the parametric convergence rate of $n^{-1/2}$ when the densities' smoothness, $s$, are both at least $d/4$ where $d$ is the dimension. We also derive minimax lower bounds for this problem which confirm that $s > d/4$ is necessary to achieve the $n^{-1/2}$ rate of convergence. We validate our theoretical guarantees with a number of simulations. △ Less

Submitted 12 May, 2014; v1 submitted 12 February, 2014; originally announced February 2014.

arXiv:1402.1754 [pdf, ps, other]

Two-stage Sampled Learning Theory on Distributions

Authors: Zoltan Szabo, Arthur Gretton, Barnabas Poczos, Bharath Sriperumbudur

Abstract: We focus on the distribution regression problem: regressing to a real-valued response from a probability distribution. Although there exist a large number of similarity measures between distributions, very little is known about their generalization performance in specific learning tasks. Learning problems formulated on distributions have an inherent two-stage sampled difficulty: in practice only s… ▽ More We focus on the distribution regression problem: regressing to a real-valued response from a probability distribution. Although there exist a large number of similarity measures between distributions, very little is known about their generalization performance in specific learning tasks. Learning problems formulated on distributions have an inherent two-stage sampled difficulty: in practice only samples from sampled distributions are observable, and one has to build an estimate on similarities computed between sets of points. To the best of our knowledge, the only existing method with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which suffers from slow convergence issues in high dimensions), and the domain of the distributions to be compact Euclidean. In this paper, we provide theoretical guarantees for a remarkably simple algorithmic alternative to solve the distribution regression problem: embed the distributions to a reproducing kernel Hilbert space, and learn a ridge regressor from the embeddings to the outputs. Our main contribution is to prove the consistency of this technique in the two-stage sampled setting under mild conditions (on separable, topological domains endowed with kernels). For a given total number of observations, we derive convergence rates as an explicit function of the problem difficulty. As a special case, we answer a 15-year-old open question: we establish the consistency of the classical set kernel [Haussler, 1999; Gartner et. al, 2002] in regression, and cover more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010]. △ Less

Submitted 26 January, 2015; v1 submitted 7 February, 2014; originally announced February 2014.

Comments: v6: accepted at AISTATS-2015 for oral presentation; final version; code: https://bitbucket.org/szzoli/ite/; extension to the misspecified and vector-valued case: https://arxiv.boxedpaper.com/abs/1411.2066

MSC Class: 62G08; 46E22; 47B32 ACM Class: G.3; I.2.6

arXiv:1311.2236 [pdf, other]

Fast Distribution To Real Regression

Authors: Junier B. Oliva, Willie Neiswanger, Barnabas Poczos, Jeff Schneider, Eric Xing

Abstract: We study the problem of distribution to real-value regression, where one aims to regress a mapping $f$ that takes in a distribution input covariate $P\in \mathcal{I}$ (for a non-parametric family of distributions $\mathcal{I}$) and outputs a real-valued response $Y=f(P) + ε$. This setting was recently studied, and a "Kernel-Kernel" estimator was introduced and shown to have a polynomial rate of co… ▽ More We study the problem of distribution to real-value regression, where one aims to regress a mapping $f$ that takes in a distribution input covariate $P\in \mathcal{I}$ (for a non-parametric family of distributions $\mathcal{I}$) and outputs a real-valued response $Y=f(P) + ε$. This setting was recently studied, and a "Kernel-Kernel" estimator was introduced and shown to have a polynomial rate of convergence. However, evaluating a new prediction with the Kernel-Kernel estimator scales as $Ω(N)$. This causes the difficult situation where a large amount of data may be necessary for a low estimation risk, but the computation cost of estimation becomes infeasible when the data-set is too large. To this end, we propose the Double-Basis estimator, which looks to alleviate this big data problem in two ways: first, the Double-Basis estimator is shown to have a computation complexity that is independent of the number of of instances $N$ when evaluating new predictions after training; secondly, the Double-Basis estimator is shown to have a fast rate of convergence for a general class of mappings $f\in\mathcal{F}$. △ Less

Submitted 8 March, 2014; v1 submitted 9 November, 2013; originally announced November 2013.

arXiv:1311.2234 [pdf, other]

FuSSO: Functional Shrinkage and Selection Operator

Authors: Junier B. Oliva, Barnabas Poczos, Timothy Verstynen, Aarti Singh, Jeff Schneider, Fang-Cheng Yeh, Wen-Yih Tseng

Abstract: We present the FuSSO, a functional analogue to the LASSO, that efficiently finds a sparse set of functional input covariates to regress a real-valued response against. The FuSSO does so in a semi-parametric fashion, making no parametric assumptions about the nature of input functional covariates and assuming a linear form to the mapping of functional covariates to the response. We provide a statis… ▽ More We present the FuSSO, a functional analogue to the LASSO, that efficiently finds a sparse set of functional input covariates to regress a real-valued response against. The FuSSO does so in a semi-parametric fashion, making no parametric assumptions about the nature of input functional covariates and assuming a linear form to the mapping of functional covariates to the response. We provide a statistical backing for use of the FuSSO via proof of asymptotic sparsistency under various conditions. Furthermore, we observe good results on both synthetic and real-world data. △ Less

Submitted 8 March, 2014; v1 submitted 9 November, 2013; originally announced November 2013.

arXiv:1302.0082 [pdf, other]

Distribution-Free Distribution Regression

Authors: Barnabas Poczos, Alessandro Rinaldo, Aarti Singh, Larry Wasserman

Abstract: `Distribution regression' refers to the situation where a response Y depends on a covariate P where P is a probability distribution. The model is Y=f(P) + mu where f is an unknown regression function and mu is a random error. Typically, we do not observe P directly, but rather, we observe a sample from P. In this paper we develop theory and methods for distribution-free versions of distribution re… ▽ More `Distribution regression' refers to the situation where a response Y depends on a covariate P where P is a probability distribution. The model is Y=f(P) + mu where f is an unknown regression function and mu is a random error. Typically, we do not observe P directly, but rather, we observe a sample from P. In this paper we develop theory and methods for distribution-free versions of distribution regression. This means that we do not make distributional assumptions about the error term mu and covariate P. We prove that when the effective dimension is small enough (as measured by the doubling dimension), then the excess prediction risk converges to zero with a polynomial rate. △ Less

Submitted 1 February, 2013; originally announced February 2013.

arXiv:1206.4682 [pdf]

Copula-based Kernel Dependency Measures

Authors: Barnabas Poczos, Zoubin Ghahramani, Jeff Schneider

Abstract: The paper presents a new copula based method for measuring dependence between random variables. Our approach extends the Maximum Mean Discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal vari… ▽ More The paper presents a new copula based method for measuring dependence between random variables. Our approach extends the Maximum Mean Discrepancy to the copula of the joint distribution. We prove that this approach has several advantageous properties. Similarly to Shannon mutual information, the proposed dependence measure is invariant to any strictly increasing transformation of the marginal variables. This is important in many applications, for example in feature selection. The estimator is consistent, robust to outliers, and uses rank statistics only. We derive upper bounds on the convergence rate and propose independence tests too. We illustrate the theoretical contributions through a series of experiments in feature selection and low-dimensional embedding of distributions. △ Less

Submitted 18 June, 2012; originally announced June 2012.

Comments: ICML2012

arXiv:1201.0341 [pdf, ps, other]

doi 10.1007/978-3-642-28551-6_31

Collaborative Filtering via Group-Structured Dictionary Learning

Authors: Zoltan Szabo, Barnabas Poczos, Andras Lorincz

Abstract: Structured sparse coding and the related structured dictionary learning problems are novel research areas in machine learning. In this paper we present a new application of structured dictionary learning for collaborative filtering based recommender systems. Our extensive numerical experiments demonstrate that the presented technique outperforms its state-of-the-art competitors and has several adv… ▽ More Structured sparse coding and the related structured dictionary learning problems are novel research areas in machine learning. In this paper we present a new application of structured dictionary learning for collaborative filtering based recommender systems. Our extensive numerical experiments demonstrate that the presented technique outperforms its state-of-the-art competitors and has several advantages over approaches that do not put structured constraints on the dictionary elements. △ Less

Submitted 1 January, 2012; originally announced January 2012.

Comments: A compressed version of the paper has been accepted for publication at the 10th International Conference on Latent Variable Analysis and Source Separation (LVA/ICA 2012)

MSC Class: 65K10; 90C26; 49M37 (Primary) ACM Class: I.2.6; I.5.4

Journal ref: International Conference on Latent Variable Analysis and Source Separation (LVA/ICA), vol. 7191 of LNCS, pp. 247-254, 2012

arXiv:math/0703612 [pdf, other]

Independent Process Analysis without A Priori Dimensional Information

Authors: Barnabas Poczos, Zoltan Szabo, Melinda Kiszlinger, Andras Lorincz

Abstract: Recently, several algorithms have been proposed for independent subspace analysis where hidden variables are i.i.d. processes. We show that these methods can be extended to certain AR, MA, ARMA and ARIMA tasks. Central to our paper is that we introduce a cascade of algorithms, which aims to solve these tasks without previous knowledge about the number and the dimensions of the hidden processes.… ▽ More Recently, several algorithms have been proposed for independent subspace analysis where hidden variables are i.i.d. processes. We show that these methods can be extended to certain AR, MA, ARMA and ARIMA tasks. Central to our paper is that we introduce a cascade of algorithms, which aims to solve these tasks without previous knowledge about the number and the dimensions of the hidden processes. Our claim is supported by numerical simulations. As a particular application, we search for subspaces of facial components. △ Less

Submitted 20 March, 2007; originally announced March 2007.

Comments: 9 pages, 2 figures

Journal ref: ICA-2007, pages 252-259

arXiv:math/0701210 [pdf, ps, other]

Undercomplete Blind Subspace Deconvolution

Authors: Zoltan Szabo, Barnabas Poczos, Andras Lorincz

Abstract: We introduce the blind subspace deconvolution (BSSD) problem, which is the extension of both the blind source deconvolution (BSD) and the independent subspace analysis (ISA) tasks. We examine the case of the undercomplete BSSD (uBSSD). Applying temporal concatenation we reduce this problem to ISA. The associated `high dimensional' ISA problem can be handled by a recent technique called joint f-d… ▽ More We introduce the blind subspace deconvolution (BSSD) problem, which is the extension of both the blind source deconvolution (BSD) and the independent subspace analysis (ISA) tasks. We examine the case of the undercomplete BSSD (uBSSD). Applying temporal concatenation we reduce this problem to ISA. The associated `high dimensional' ISA problem can be handled by a recent technique called joint f-decorrelation (JFD). Similar decorrelation methods have been used previously for kernel independent component analysis (kernel-ICA). More precisely, the kernel canonical correlation (KCCA) technique is a member of this family, and, as is shown in this paper, the kernel generalized variance (KGV) method can also be seen as a decorrelation method in the feature space. These kernel based algorithms will be adapted to the ISA task. In the numerical examples, we (i) examine how efficiently the emerging higher dimensional ISA tasks can be tackled, and (ii) explore the working and advantages of the derived kernel-ISA methods. △ Less

Submitted 20 May, 2007; v1 submitted 7 January, 2007; originally announced January 2007.

Comments: Final version, appeared in Journal of Machine Learning Research

Journal ref: Zoltan Szabo, Barnabas Poczos, Andras Lorincz: Undercomplete Blind Subspace Deconvolution. Journal of Machine Learning Research 8(May):1063-1095, 2007

arXiv:math/0608100 [pdf, ps, other]

Separation Theorem for K-Independent Subspace Analysis with Sufficient Conditions

Authors: Zoltan Szabo, Barnabas Poczos, Andras Lorincz

Abstract: Here, a Separation Theorem about K-Independent Subspace Analysis (K real or complex), a generalization of K-Independent Component Analysis (KICA) is proven. According to the theorem, KISA estimation can be executed in two steps under certain conditions. In the first step, 1-dimensional KICA estimation is executed. In the second step, optimal permutation of the KICA elements is searched for. We p… ▽ More Here, a Separation Theorem about K-Independent Subspace Analysis (K real or complex), a generalization of K-Independent Component Analysis (KICA) is proven. According to the theorem, KISA estimation can be executed in two steps under certain conditions. In the first step, 1-dimensional KICA estimation is executed. In the second step, optimal permutation of the KICA elements is searched for. We present sufficient conditions for the KISA Separation Theorem. Namely, we shall show that (i) spherically symmetric sources (both for real and complex cases), as well as (ii) real 2-dimensional sources invariant to 90 degree rotation, among others, satisfy the conditions of the theorem. △ Less

Submitted 10 August, 2006; v1 submitted 3 August, 2006; originally announced August 2006.

Comments: Reference [13]: corrected

arXiv:math/0603535 [pdf, ps, other]

Separation Theorem for Independent Subspace Analysis with Sufficient Conditions

Authors: Zoltan Szabo, Barnabas Poczos, Andras Lorincz

Abstract: Here, a separation theorem about Independent Subspace Analysis (ISA), a generalization of Independent Component Analysis (ICA) is proven. According to the theorem, ISA estimation can be executed in two steps under certain conditions. In the first step, 1-dimensional ICA estimation is executed. In the second step, optimal permutation of the ICA elements is searched for. We present sufficient cond… ▽ More Here, a separation theorem about Independent Subspace Analysis (ISA), a generalization of Independent Component Analysis (ICA) is proven. According to the theorem, ISA estimation can be executed in two steps under certain conditions. In the first step, 1-dimensional ICA estimation is executed. In the second step, optimal permutation of the ICA elements is searched for. We present sufficient conditions for the ISA Separation Theorem. Namely, we shall show that (i) elliptically symmetric sources, (ii) 2-dimensional sources invariant to 90 degree rotation, among others, satisfy the conditions of the theorem. △ Less

Submitted 23 March, 2006; v1 submitted 22 March, 2006; originally announced March 2006.

Comments: 11 pages, 0 figures

MSC Class: 62G05; 62M10

Showing 1–38 of 38 results for author: Póczos, B