-
Langevin Diffusion Approximation to Same Marginal Schrödinger Bridge
Authors:
Medha Agarwal,
Zaid Harchaoui,
Garrett Mulcahy,
Soumik Pal
Abstract:
We introduce a novel approximation to the same marginal Schrödinger bridge using the Langevin diffusion. As $\varepsilon \downarrow 0$, it is known that the barycentric projection (also known as the entropic Brenier map) of the Schrödinger bridge converges to the Brenier map, which is the identity. Our diffusion approximation is leveraged to show that, under suitable assumptions, the difference be…
▽ More
We introduce a novel approximation to the same marginal Schrödinger bridge using the Langevin diffusion. As $\varepsilon \downarrow 0$, it is known that the barycentric projection (also known as the entropic Brenier map) of the Schrödinger bridge converges to the Brenier map, which is the identity. Our diffusion approximation is leveraged to show that, under suitable assumptions, the difference between the two is $\varepsilon$ times the gradient of the marginal log density (i.e., the score function), in $\mathbf{L}^2$. More generally, we show that the family of Markov operators, indexed by $\varepsilon > 0$, derived from integrating test functions against the conditional density of the static Schrödinger bridge at temperature $\varepsilon$, admits a derivative at $\varepsilon=0$ given by the generator of the Langevin semigroup. Hence, these operators satisfy an approximate semigroup property at low temperatures.
△ Less
Submitted 12 May, 2025;
originally announced May 2025.
-
Spectral Differential Network Analysis for High-Dimensional Time Series
Authors:
Michael Hellstern,
Byol Kim,
Zaid Harchaoui,
Ali Shojaie
Abstract:
Spectral networks derived from multivariate time series data arise in many domains, from brain science to Earth science. Often, it is of interest to study how these networks change under different conditions. For instance, to better understand epilepsy, it would be interesting to capture the changes in the brain connectivity network as a patient experiences a seizure, using electroencephalography…
▽ More
Spectral networks derived from multivariate time series data arise in many domains, from brain science to Earth science. Often, it is of interest to study how these networks change under different conditions. For instance, to better understand epilepsy, it would be interesting to capture the changes in the brain connectivity network as a patient experiences a seizure, using electroencephalography data. A common approach relies on estimating the networks in each condition and calculating their difference. Such estimates may behave poorly in high dimensions as the networks themselves may not be sparse in structure while their difference may be. We build upon this observation to develop an estimator of the difference in inverse spectral densities across two conditions. Using an L1 penalty on the difference, consistency is established by only requiring the difference to be sparse. We illustrate the method on synthetic data experiments, on experiments with electroencephalography data, and on experiments with optogentic stimulation and micro-electrocorticography data.
△ Less
Submitted 10 December, 2024;
originally announced December 2024.
-
The Benefits of Balance: From Information Projections to Variance Reduction
Authors:
Lang Liu,
Ronak Mehta,
Soumik Pal,
Zaid Harchaoui
Abstract:
Data balancing across multiple modalities and sources appears in various forms in foundation models in machine learning and AI, e.g. in CLIP and DINO. We show that data balancing across modalities and sources actually offers an unsuspected benefit: variance reduction. We present a non-asymptotic statistical bound that quantifies this variance reduction effect and relates it to the eigenvalue decay…
▽ More
Data balancing across multiple modalities and sources appears in various forms in foundation models in machine learning and AI, e.g. in CLIP and DINO. We show that data balancing across modalities and sources actually offers an unsuspected benefit: variance reduction. We present a non-asymptotic statistical bound that quantifies this variance reduction effect and relates it to the eigenvalue decay of Markov operators. Furthermore, we describe how various forms of data balancing in contrastive multimodal learning and self-supervised clustering can be better understood, and even improved upon, owing to our variance reduction viewpoint.
△ Less
Submitted 11 February, 2025; v1 submitted 27 August, 2024;
originally announced August 2024.
-
Iterated Schrödinger bridge approximation to Wasserstein Gradient Flows
Authors:
Medha Agarwal,
Zaid Harchaoui,
Garrett Mulcahy,
Soumik Pal
Abstract:
We introduce a novel discretization scheme for Wasserstein gradient flows that involves successively computing Schrödinger bridges with the same marginals. This is different from both the forward/geodesic approximation and the backward/Jordan-Kinderlehrer-Otto (JKO) approximations. The proposed scheme has two advantages: one, it avoids the use of the score function, and, two, it is amenable to par…
▽ More
We introduce a novel discretization scheme for Wasserstein gradient flows that involves successively computing Schrödinger bridges with the same marginals. This is different from both the forward/geodesic approximation and the backward/Jordan-Kinderlehrer-Otto (JKO) approximations. The proposed scheme has two advantages: one, it avoids the use of the score function, and, two, it is amenable to particle-based approximations using the Sinkhorn algorithm. Our proof hinges upon showing that relative entropy between the Schrödinger bridge with the same marginals at temperature $ε$ and the joint distribution of a stationary Langevin diffusion at times zero and $ε$ is of the order $o(ε^2)$ with an explicit dependence given by Fisher information. Owing to this inequality, we can show, using a triangular approximation argument, that the interpolated iterated application of the Schrödinger bridge approximation converge to the Wasserstein gradient flow, for a class of gradient flows, including the heat flow. The results also provide a probabilistic and rigorous framework for the convergence of the self-attention mechanisms in transformer networks to the solutions of heat flows, first observed in the inspiring work SABP22 in machine learning research.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Drago: Primal-Dual Coupled Variance Reduction for Faster Distributionally Robust Optimization
Authors:
Ronak Mehta,
Jelena Diakonikolas,
Zaid Harchaoui
Abstract:
We consider the penalized distributionally robust optimization (DRO) problem with a closed, convex uncertainty set, a setting that encompasses learning using $f$-DRO and spectral/$L$-risk minimization. We present Drago, a stochastic primal-dual algorithm that combines cyclic and randomized components with a carefully regularized primal update to achieve dual variance reduction. Owing to its design…
▽ More
We consider the penalized distributionally robust optimization (DRO) problem with a closed, convex uncertainty set, a setting that encompasses learning using $f$-DRO and spectral/$L$-risk minimization. We present Drago, a stochastic primal-dual algorithm that combines cyclic and randomized components with a carefully regularized primal update to achieve dual variance reduction. Owing to its design, Drago enjoys a state-of-the-art linear convergence rate on strongly convex-strongly concave DRO problems with a fine-grained dependency on primal and dual condition numbers. Theoretical results are supported by numerical benchmarks on regression and classification tasks.
△ Less
Submitted 11 February, 2025; v1 submitted 15 March, 2024;
originally announced March 2024.
-
Distributionally Robust Optimization with Bias and Variance Reduction
Authors:
Ronak Mehta,
Vincent Roulet,
Krishna Pillutla,
Zaid Harchaoui
Abstract:
We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparame…
▽ More
We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and $f$-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-$k$ loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3$\times$ faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
Confidence Sets under Generalized Self-Concordance
Authors:
Lang Liu,
Zaid Harchaoui
Abstract:
This paper revisits a fundamental problem in statistical inference from a non-asymptotic theoretical viewpoint $\unicode{x2013}$ the construction of confidence sets. We establish a finite-sample bound for the estimator, characterizing its asymptotic behavior in a non-asymptotic fashion. An important feature of our bound is that its dimension dependency is captured by the effective dimension…
▽ More
This paper revisits a fundamental problem in statistical inference from a non-asymptotic theoretical viewpoint $\unicode{x2013}$ the construction of confidence sets. We establish a finite-sample bound for the estimator, characterizing its asymptotic behavior in a non-asymptotic fashion. An important feature of our bound is that its dimension dependency is captured by the effective dimension $\unicode{x2013}$ the trace of the limiting sandwich covariance $\unicode{x2013}$ which can be much smaller than the parameter dimension in some regimes. We then illustrate how the bound can be used to obtain a confidence set whose shape is adapted to the optimization landscape induced by the loss function. Unlike previous works that rely heavily on the strong convexity of the loss function, we only assume the Hessian is lower bounded at optimum and allow it to gradually becomes degenerate. This property is formalized by the notion of generalized self-concordance which originated from convex optimization. Moreover, we demonstrate how the effective dimension can be estimated from data and characterize its estimation accuracy. We apply our results to maximum likelihood estimation with generalized linear models, score matching with exponential families, and hypothesis testing with Rao's score test.
△ Less
Submitted 31 December, 2022;
originally announced January 2023.
-
Stochastic Optimization for Spectral Risk Measures
Authors:
Ronak Mehta,
Vincent Roulet,
Krishna Pillutla,
Lang Liu,
Zaid Harchaoui
Abstract:
Spectral risk objectives - also called $L$-risks - allow for learning systems to interpolate between optimizing average-case performance (as in empirical risk minimization) and worst-case performance on a task. We develop stochastic algorithms to optimize these quantities by characterizing their subdifferential and addressing challenges such as biasedness of subgradient estimates and non-smoothnes…
▽ More
Spectral risk objectives - also called $L$-risks - allow for learning systems to interpolate between optimizing average-case performance (as in empirical risk minimization) and worst-case performance on a task. We develop stochastic algorithms to optimize these quantities by characterizing their subdifferential and addressing challenges such as biasedness of subgradient estimates and non-smoothness of the objective. We show theoretically and experimentally that out-of-the-box approaches such as stochastic subgradient and dual averaging are hindered by bias and that our approach outperforms them.
△ Less
Submitted 9 December, 2022;
originally announced December 2022.
-
Statistical and Computational Guarantees for Influence Diagnostics
Authors:
Jillian Fisher,
Lang Liu,
Krishna Pillutla,
Yejin Choi,
Zaid Harchaoui
Abstract:
Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approx…
▽ More
Influence diagnostics such as influence functions and approximate maximum influence perturbations are popular in machine learning and in AI domain applications. Influence diagnostics are powerful statistical tools to identify influential datapoints or subsets of datapoints. We establish finite-sample statistical bounds, as well as computational complexity bounds, for influence functions and approximate maximum influence perturbations using efficient inverse-Hessian-vector product implementations. We illustrate our results with generalized linear models and large attention based models on synthetic and real data.
△ Less
Submitted 19 September, 2023; v1 submitted 7 December, 2022;
originally announced December 2022.
-
Stochastic optimization on matrices and a graphon McKean-Vlasov limit
Authors:
Zaid Harchaoui,
Sewoong Oh,
Soumik Pal,
Raghav Somani,
Raghavendra Tripathi
Abstract:
We consider stochastic gradient descents on the space of large symmetric matrices of suitable functions that are invariant under permuting the rows and columns using the same permutation. We establish deterministic limits of these random curves as the dimensions of the matrices go to infinity while the entries remain bounded. Under a ``small noise'' assumption the limit is shown to be the gradient…
▽ More
We consider stochastic gradient descents on the space of large symmetric matrices of suitable functions that are invariant under permuting the rows and columns using the same permutation. We establish deterministic limits of these random curves as the dimensions of the matrices go to infinity while the entries remain bounded. Under a ``small noise'' assumption the limit is shown to be the gradient flow of functions on graphons whose existence was established in~\cite{oh2021gradient}. We also consider limits of stochastic gradient descents with added properly scaled reflected Brownian noise. The limiting curve of graphons is characterized by a family of stochastic differential equations with reflections and can be thought of as an extension of the classical McKean-Vlasov limit for interacting diffusions to the graphon setting. The proofs introduce a family of infinite-dimensional exchangeable arrays of reflected diffusions and a novel notion of propagation of chaos for large matrices of diffusions converging to such arrays in a suitable sense.
△ Less
Submitted 27 May, 2024; v1 submitted 2 October, 2022;
originally announced October 2022.
-
Orthogonal Statistical Learning with Self-Concordant Loss
Authors:
Lang Liu,
Carlos Cinelli,
Zaid Harchaoui
Abstract:
Orthogonal statistical learning and double machine learning have emerged as general frameworks for two-stage statistical prediction in the presence of a nuisance component. We establish non-asymptotic bounds on the excess risk of orthogonal statistical learning methods with a loss function satisfying a self-concordance property. Our bounds improve upon existing bounds by a dimension factor while l…
▽ More
Orthogonal statistical learning and double machine learning have emerged as general frameworks for two-stage statistical prediction in the presence of a nuisance component. We establish non-asymptotic bounds on the excess risk of orthogonal statistical learning methods with a loss function satisfying a self-concordance property. Our bounds improve upon existing bounds by a dimension factor while lifting the assumption of strong convexity. We illustrate the results with examples from multiple treatment effect estimation and generalized partially linear modeling.
△ Less
Submitted 19 June, 2022; v1 submitted 30 April, 2022;
originally announced May 2022.
-
Flat minima generalize for low-rank matrix recovery
Authors:
Lijun Ding,
Dmitriy Drusvyatskiy,
Maryam Fazel,
Zaid Harchaoui
Abstract:
Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameter…
▽ More
Empirical evidence suggests that for a variety of overparameterized nonlinear models, most notably in neural network training, the growth of the loss around a minimizer strongly impacts its performance. Flat minima -- those around which the loss grows slowly -- appear to generalize well. This work takes a step towards understanding this phenomenon by focusing on the simplest class of overparameterized nonlinear models: those arising in low-rank matrix recovery. We analyze overparameterized matrix and bilinear sensing, robust PCA, covariance matrix estimation, and single hidden layer neural networks with quadratic activation functions. In all cases, we show that flat minima, measured by the trace of the Hessian, exactly recover the ground truth under standard statistical assumptions. For matrix completion, we establish weak recovery, although empirical evidence suggests exact recovery holds here as well. We conclude with synthetic experiments that illustrate our findings and discuss the effect of depth on flat solutions.
△ Less
Submitted 17 February, 2023; v1 submitted 7 March, 2022;
originally announced March 2022.
-
Triangular Flows for Generative Modeling: Statistical Consistency, Smoothness Classes, and Fast Rates
Authors:
Nicholas J. Irons,
Meyer Scetbon,
Soumik Pal,
Zaid Harchaoui
Abstract:
Triangular flows, also known as Knöthe-Rosenblatt measure couplings, comprise an important building block of normalizing flow models for generative modeling and density estimation, including popular autoregressive flow models such as real-valued non-volume preserving transformation models (Real NVP). We present statistical guarantees and sample complexity bounds for triangular flow statistical mod…
▽ More
Triangular flows, also known as Knöthe-Rosenblatt measure couplings, comprise an important building block of normalizing flow models for generative modeling and density estimation, including popular autoregressive flow models such as real-valued non-volume preserving transformation models (Real NVP). We present statistical guarantees and sample complexity bounds for triangular flow statistical models. In particular, we establish the statistical consistency and the finite sample convergence rates of the Kullback-Leibler estimator of the Knöthe-Rosenblatt measure coupling using tools from empirical process theory. Our results highlight the anisotropic geometry of function classes at play in triangular flows, shed light on optimal coordinate ordering, and lead to statistical guarantees for Jacobian flows. We conduct numerical experiments on synthetic data to illustrate the practical implications of our theoretical findings.
△ Less
Submitted 31 December, 2021;
originally announced December 2021.
-
Entropy Regularized Optimal Transport Independence Criterion
Authors:
Lang Liu,
Soumik Pal,
Zaid Harchaoui
Abstract:
We introduce an independence criterion based on entropy regularized optimal transport. Our criterion can be used to test for independence between two samples. We establish non-asymptotic bounds for our test statistic and study its statistical behavior under both the null hypothesis and the alternative hypothesis. The theoretical results involve tools from U-process theory and optimal transport the…
▽ More
We introduce an independence criterion based on entropy regularized optimal transport. Our criterion can be used to test for independence between two samples. We establish non-asymptotic bounds for our test statistic and study its statistical behavior under both the null hypothesis and the alternative hypothesis. The theoretical results involve tools from U-process theory and optimal transport theory. We also offer a random feature type approximation for large-scale problems, as well as a differentiable program implementation for deep learning applications. We present experimental results on existing benchmarks for independence testing, illustrating the interest of the proposed criterion to capture both linear and nonlinear dependencies in synthetic data and real data.
△ Less
Submitted 19 April, 2022; v1 submitted 30 December, 2021;
originally announced December 2021.
-
Federated Learning with Superquantile Aggregation for Heterogeneous Data
Authors:
Krishna Pillutla,
Yassine Laguel,
Jérôme Malick,
Zaid Harchaoui
Abstract:
We present a federated learning framework that is designed to robustly deliver good predictive performance across individual clients with heterogeneous data. The proposed approach hinges upon a superquantile-based learning objective that captures the tail statistics of the error distribution over heterogeneous clients. We present a stochastic training algorithm that interleaves differentially priv…
▽ More
We present a federated learning framework that is designed to robustly deliver good predictive performance across individual clients with heterogeneous data. The proposed approach hinges upon a superquantile-based learning objective that captures the tail statistics of the error distribution over heterogeneous clients. We present a stochastic training algorithm that interleaves differentially private client filtering with federated averaging steps. We prove finite time convergence guarantees for the algorithm: $O(1/\sqrt{T})$ in the nonconvex case in $T$ communication rounds and $O(\exp(-T/κ^{3/2}) + κ/T)$ in the strongly convex case with local condition number $κ$. Experimental results on benchmark datasets for federated learning demonstrate that our approach is competitive with classical ones in terms of average error and outperforms them in terms of tail statistics of the error.
△ Less
Submitted 6 December, 2022; v1 submitted 17 December, 2021;
originally announced December 2021.
-
Score-Based Change Detection for Gradient-Based Learning Machines
Authors:
Lang Liu,
Joseph Salmon,
Zaid Harchaoui
Abstract:
The widespread use of machine learning algorithms calls for automatic change detection algorithms to monitor their behavior over time. As a machine learning algorithm learns from a continuous, possibly evolving, stream of data, it is desirable and often critical to supplement it with a companion change detection algorithm to facilitate its monitoring and control. We present a generic score-based c…
▽ More
The widespread use of machine learning algorithms calls for automatic change detection algorithms to monitor their behavior over time. As a machine learning algorithm learns from a continuous, possibly evolving, stream of data, it is desirable and often critical to supplement it with a companion change detection algorithm to facilitate its monitoring and control. We present a generic score-based change detection method that can detect a change in any number of components of a machine learning model trained via empirical risk minimization. This proposed statistical hypothesis test can be readily implemented for such models designed within a differentiable programming framework. We establish the consistency of the hypothesis test and show how to calibrate it to achieve a prescribed false alarm rate. We illustrate the versatility of the approach on synthetic and real data.
△ Less
Submitted 26 June, 2021;
originally announced June 2021.
-
Divergence Frontiers for Generative Models: Sample Complexity, Quantization Effects, and Frontier Integrals
Authors:
Lang Liu,
Krishna Pillutla,
Sean Welleck,
Sewoong Oh,
Yejin Choi,
Zaid Harchaoui
Abstract:
The spectacular success of deep generative models calls for quantitative tools to measure their statistical performance. Divergence frontiers have recently been proposed as an evaluation framework for generative models, due to their ability to measure the quality-diversity trade-off inherent to deep generative modeling. We establish non-asymptotic bounds on the sample complexity of divergence fron…
▽ More
The spectacular success of deep generative models calls for quantitative tools to measure their statistical performance. Divergence frontiers have recently been proposed as an evaluation framework for generative models, due to their ability to measure the quality-diversity trade-off inherent to deep generative modeling. We establish non-asymptotic bounds on the sample complexity of divergence frontiers. We also introduce frontier integrals which provide summary statistics of divergence frontiers. We show how smoothed estimators such as Good-Turing or Krichevsky-Trofimov can overcome the missing mass problem and lead to faster rates of convergence. We illustrate the theoretical results with numerical examples from natural language processing and computer vision.
△ Less
Submitted 11 December, 2021; v1 submitted 15 June, 2021;
originally announced June 2021.
-
Differentiable Programming à la Moreau
Authors:
Vincent Roulet,
Zaid Harchaoui
Abstract:
The notion of a Moreau envelope is central to the analysis of first-order optimization algorithms for machine learning. Yet, it has not been developed and extended to be applied to a deep network and, more broadly, to a machine learning system with a differentiable programming implementation. We define a compositional calculus adapted to Moreau envelopes and show how to integrate it within differe…
▽ More
The notion of a Moreau envelope is central to the analysis of first-order optimization algorithms for machine learning. Yet, it has not been developed and extended to be applied to a deep network and, more broadly, to a machine learning system with a differentiable programming implementation. We define a compositional calculus adapted to Moreau envelopes and show how to integrate it within differentiable programming. The proposed framework casts in a mathematical optimization framework several variants of gradient back-propagation related to the idea of the propagation of virtual targets.
△ Less
Submitted 11 December, 2022; v1 submitted 31 December, 2020;
originally announced December 2020.
-
Faster Policy Learning with Continuous-Time Gradients
Authors:
Samuel Ainsworth,
Kendall Lowrey,
John Thickstun,
Zaid Harchaoui,
Siddhartha Srinivasa
Abstract:
We study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator. The standard back-propagation through time estimator (BPTT) computes exact gradients for a crude discretization of the continuous-time system. In contrast, we approximate…
▽ More
We study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator. The standard back-propagation through time estimator (BPTT) computes exact gradients for a crude discretization of the continuous-time system. In contrast, we approximate continuous-time gradients in the original system. With the explicit goal of estimating continuous-time gradients, we are able to discretize adaptively and construct a more efficient policy gradient estimator which we call the Continuous-Time Policy Gradient (CTPG). We show that replacing BPTT policy gradients with more efficient CTPG estimates results in faster and more robust learning in a variety of control tasks and simulators.
△ Less
Submitted 24 June, 2021; v1 submitted 11 December, 2020;
originally announced December 2020.
-
Asymptotics of Discrete Schrödinger Bridges via Chaos Decomposition
Authors:
Zaid Harchaoui,
Lang Liu,
Soumik Pal
Abstract:
Consider the problem of matching two independent i.i.d. samples of size $N$ from two distributions $P$ and $Q$ in $\mathbb{R}^d$. For an arbitrary continuous cost function, the optimal assignment problem looks for the matching that minimizes the total cost. We consider instead in this paper the problem where each matching is endowed with a Gibbs probability weight proportional to the exponential o…
▽ More
Consider the problem of matching two independent i.i.d. samples of size $N$ from two distributions $P$ and $Q$ in $\mathbb{R}^d$. For an arbitrary continuous cost function, the optimal assignment problem looks for the matching that minimizes the total cost. We consider instead in this paper the problem where each matching is endowed with a Gibbs probability weight proportional to the exponential of the negative total cost of that matching. Viewing each matching as a joint distribution with $N$ atoms, we then take a convex combination with respect to the above Gibbs probability measure. We show that this resulting random joint distribution converges, as $N\rightarrow \infty$, to the solution of a variational problem, introduced by Föllmer, called the Schrödinger problem. We also derive the first two error terms of orders $N^{-1/2}$ and $N^{-1}$, respectively. This gives us central limit theorems for integrated test functions, including for the cost of transport, and second order Gaussian chaos limits when the limiting Gaussian variance is zero. The proofs are based on a novel chaos decomposition of the discrete Schrödinger bridge by polynomial functions of the pair of empirical distributions as the first and second order Taylor approximations in the space of measures. This is achieved by extending the Hoeffding decomposition from the classical theory of U-statistics.
△ Less
Submitted 31 December, 2022; v1 submitted 17 November, 2020;
originally announced November 2020.
-
First-order Optimization for Superquantile-based Supervised Learning
Authors:
Yassine Laguel,
Jérôme Malick,
Zaid Harchaoui
Abstract:
Classical supervised learning via empirical risk (or negative log-likelihood) minimization hinges upon the assumption that the testing distribution coincides with the training distribution. This assumption can be challenged in modern applications of machine learning in which learning machines may operate at prediction time with testing data whose distribution departs from the one of the training d…
▽ More
Classical supervised learning via empirical risk (or negative log-likelihood) minimization hinges upon the assumption that the testing distribution coincides with the training distribution. This assumption can be challenged in modern applications of machine learning in which learning machines may operate at prediction time with testing data whose distribution departs from the one of the training data. We revisit the superquantile regression method by proposing a first-order optimization algorithm to minimize a superquantile-based learning objective. The proposed algorithm is based on smoothing the superquantile function by infimal convolution. Promising numerical results illustrate the interest of the approach towards safer supervised learning.
△ Less
Submitted 1 October, 2020; v1 submitted 30 September, 2020;
originally announced September 2020.
-
Harmonic Decompositions of Convolutional Networks
Authors:
Meyer Scetbon,
Zaid Harchaoui
Abstract:
We present a description of the function space and the smoothness class associated with a convolutional network using the machinery of reproducing kernel Hilbert spaces. We show that the mapping associated with a convolutional network expands into a sum involving elementary functions akin to spherical harmonics. This functional decomposition can be related to the functional ANOVA decomposition in…
▽ More
We present a description of the function space and the smoothness class associated with a convolutional network using the machinery of reproducing kernel Hilbert spaces. We show that the mapping associated with a convolutional network expands into a sum involving elementary functions akin to spherical harmonics. This functional decomposition can be related to the functional ANOVA decomposition in nonparametric statistics. Building off our functional characterization of convolutional networks, we obtain statistical bounds highlighting an interesting trade-off between the approximation error and the estimation error.
△ Less
Submitted 16 November, 2020; v1 submitted 28 March, 2020;
originally announced March 2020.
-
A Spectral Analysis of Dot-product Kernels
Authors:
Meyer Scetbon,
Zaid Harchaoui
Abstract:
We present eigenvalue decay estimates of integral operators associated with compositional dot-product kernels. The estimates improve on previous ones established for power series kernels on spheres. This allows us to obtain the volumes of balls in the corresponding reproducing kernel Hilbert spaces. We discuss the consequences on statistical estimation with compositional dot product kernels and hi…
▽ More
We present eigenvalue decay estimates of integral operators associated with compositional dot-product kernels. The estimates improve on previous ones established for power series kernels on spheres. This allows us to obtain the volumes of balls in the corresponding reproducing kernel Hilbert spaces. We discuss the consequences on statistical estimation with compositional dot product kernels and highlight interesting trade-offs between the approximation error and the statistical error depending on the number of compositions and the smoothness of the kernels.
△ Less
Submitted 26 February, 2021; v1 submitted 28 February, 2020;
originally announced February 2020.
-
Device Heterogeneity in Federated Learning: A Superquantile Approach
Authors:
Yassine Laguel,
Krishna Pillutla,
Jérôme Malick,
Zaid Harchaoui
Abstract:
We propose a federated learning framework to handle heterogeneous client devices which do not conform to the population data distribution. The approach hinges upon a parameterized superquantile-based objective, where the parameter ranges over levels of conformity. We present an optimization algorithm and establish its convergence to a stationary point. We show how to practically implement it using…
▽ More
We propose a federated learning framework to handle heterogeneous client devices which do not conform to the population data distribution. The approach hinges upon a parameterized superquantile-based objective, where the parameter ranges over levels of conformity. We present an optimization algorithm and establish its convergence to a stationary point. We show how to practically implement it using secure aggregation by interleaving iterations of the usual federated averaging method with device filtering. We conclude with numerical experiments on neural networks as well as linear models on tasks from computer vision and natural language processing.
△ Less
Submitted 25 February, 2020;
originally announced February 2020.
-
An Elementary Approach to Convergence Guarantees of Optimization Algorithms for Deep Networks
Authors:
Vincent Roulet,
Zaid Harchaoui
Abstract:
We present an approach to obtain convergence guarantees of optimization algorithms for deep networks based on elementary arguments and computations. The convergence analysis revolves around the analytical and computational structures of optimization oracles central to the implementation of deep networks in machine learning software. We provide a systematic way to compute estimates of the smoothnes…
▽ More
We present an approach to obtain convergence guarantees of optimization algorithms for deep networks based on elementary arguments and computations. The convergence analysis revolves around the analytical and computational structures of optimization oracles central to the implementation of deep networks in machine learning software. We provide a systematic way to compute estimates of the smoothness constants that govern the convergence behavior of first-order optimization algorithms used to train deep networks. A diverse set of example components and architectures arising in modern deep networks intersperse the exposition to illustrate the approach.
△ Less
Submitted 29 December, 2020; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Robust Aggregation for Federated Learning
Authors:
Krishna Pillutla,
Sham M. Kakade,
Zaid Harchaoui
Abstract:
Federated learning is the centralized training of statistical models from decentralized data on mobile devices while preserving the privacy of each device. We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The approach relies on a robust aggregation oracle based on the geometric medi…
▽ More
Federated learning is the centralized training of statistical models from decentralized data on mobile devices while preserving the privacy of each device. We present a robust aggregation approach to make federated learning robust to settings when a fraction of the devices may be sending corrupted updates to the server. The approach relies on a robust aggregation oracle based on the geometric median, which returns a robust aggregate using a constant number of iterations of a regular non-robust averaging oracle. The robust aggregation oracle is privacy-preserving, similar to the non-robust secure average oracle it builds upon. We establish its convergence for least squares estimation of additive models. We provide experimental results with linear models and deep networks for three tasks in computer vision and natural language processing. The robust aggregation approach is agnostic to the level of corruption; it outperforms the classical aggregation approach in terms of robustness when the level of corruption is high, while being competitive in the regime of low corruption. Two variants, a faster one with one-step robust aggregation and another one with on-device personalization, round off the paper.
△ Less
Submitted 17 January, 2022; v1 submitted 31 December, 2019;
originally announced December 2019.
-
Discriminative Clustering with Representation Learning with any Ratio of Labeled to Unlabeled Data
Authors:
Corinne Jones,
Vincent Roulet,
Zaid Harchaoui
Abstract:
We present a discriminative clustering approach in which the feature representation can be learned from data and moreover leverage labeled data. Representation learning can give a similarity-based clustering method the ability to automatically adapt to an underlying, yet hidden, geometric structure of the data. The proposed approach augments the DIFFRAC method with a representation learning capabi…
▽ More
We present a discriminative clustering approach in which the feature representation can be learned from data and moreover leverage labeled data. Representation learning can give a similarity-based clustering method the ability to automatically adapt to an underlying, yet hidden, geometric structure of the data. The proposed approach augments the DIFFRAC method with a representation learning capability, using a gradient-based stochastic training algorithm and an optimal transport algorithm with entropic regularization to perform the cluster assignment step. The resulting method is evaluated on several real datasets when varying the ratio of labeled data to unlabeled data and thereby interpolating between the fully unsupervised regime and the fully supervised regime. The experimental results suggest that the proposed method can learn powerful feature representations even in the fully unsupervised regime and can leverage even small amounts of labeled data to improve the feature representations and to obtain better clusterings of complex datasets.
△ Less
Submitted 17 February, 2023; v1 submitted 30 December, 2019;
originally announced December 2019.
-
Advances and Open Problems in Federated Learning
Authors:
Peter Kairouz,
H. Brendan McMahan,
Brendan Avent,
Aurélien Bellet,
Mehdi Bennis,
Arjun Nitin Bhagoji,
Kallista Bonawitz,
Zachary Charles,
Graham Cormode,
Rachel Cummings,
Rafael G. L. D'Oliveira,
Hubert Eichner,
Salim El Rouayheb,
David Evans,
Josh Gardner,
Zachary Garrett,
Adrià Gascón,
Badih Ghazi,
Phillip B. Gibbons,
Marco Gruteser,
Zaid Harchaoui,
Chaoyang He,
Lie He,
Zhouyuan Huo,
Ben Hutchinson
, et al. (34 additional authors not shown)
Abstract:
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs re…
▽ More
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
△ Less
Submitted 8 March, 2021; v1 submitted 10 December, 2019;
originally announced December 2019.
-
A Statistical Investigation of Long Memory in Language and Music
Authors:
Alexander Greaves-Tunnell,
Zaid Harchaoui
Abstract:
Representation and learning of long-range dependencies is a central challenge confronted in modern applications of machine learning to sequence data. Yet despite the prominence of this issue, the basic problem of measuring long-range dependence, either in a given data source or as represented in a trained deep model, remains largely limited to heuristic tools. We contribute a statistical framework…
▽ More
Representation and learning of long-range dependencies is a central challenge confronted in modern applications of machine learning to sequence data. Yet despite the prominence of this issue, the basic problem of measuring long-range dependence, either in a given data source or as represented in a trained deep model, remains largely limited to heuristic tools. We contribute a statistical framework for investigating long-range dependence in current applications of deep sequence modeling, drawing on the well-developed theory of long memory stochastic processes. This framework yields testable implications concerning the relationship between long memory in real-world data and its learned representation in a deep learning architecture, which are explored through a semiparametric framework adapted to the high-dimensional setting.
△ Less
Submitted 6 June, 2019; v1 submitted 8 April, 2019;
originally announced April 2019.
-
Kernel-based Translations of Convolutional Networks
Authors:
Corinne Jones,
Vincent Roulet,
Zaid Harchaoui
Abstract:
Convolutional Neural Networks, as most artificial neural networks, are commonly viewed as methods different in essence from kernel-based methods. We provide a systematic translation of Convolutional Neural Networks (ConvNets) into their kernel-based counterparts, Convolutional Kernel Networks (CKNs), and demonstrate that this perception is unfounded both formally and empirically. We show that, giv…
▽ More
Convolutional Neural Networks, as most artificial neural networks, are commonly viewed as methods different in essence from kernel-based methods. We provide a systematic translation of Convolutional Neural Networks (ConvNets) into their kernel-based counterparts, Convolutional Kernel Networks (CKNs), and demonstrate that this perception is unfounded both formally and empirically. We show that, given a Convolutional Neural Network, we can design a corresponding Convolutional Kernel Network, easily trainable using a new stochastic gradient algorithm based on an accurate gradient computation, that performs on par with its Convolutional Neural Network counterpart. We present experimental results supporting our claims on landmark ConvNet architectures comparing each ConvNet to its CKN counterpart over several parameter settings.
△ Less
Submitted 19 March, 2019;
originally announced March 2019.
-
A Smoother Way to Train Structured Prediction Models
Authors:
Krishna Pillutla,
Vincent Roulet,
Sham M. Kakade,
Zaid Harchaoui
Abstract:
We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optim…
▽ More
We present a framework to train a structured prediction model by performing smoothing on the inference algorithm it builds upon. Smoothing overcomes the non-smoothness inherent to the maximum margin structured prediction objective, and paves the way for the use of fast primal gradient-based optimization algorithms. We illustrate the proposed framework by developing a novel primal incremental optimization algorithm for the structural support vector machine. The proposed algorithm blends an extrapolation scheme for acceleration and an adaptive smoothing scheme and builds upon the stochastic variance-reduced gradient algorithm. We establish its worst-case global complexity bound and study several practical variants, including extensions to deep structured prediction. We present experimental results on two real-world problems, namely named entity recognition and visual object localization. The experimental results show that the proposed framework allows us to build upon efficient inference algorithms to develop large-scale optimization algorithms for structured prediction which can achieve competitive performance on the two real-world problems.
△ Less
Submitted 8 February, 2019;
originally announced February 2019.
-
Coupled Recurrent Models for Polyphonic Music Composition
Authors:
John Thickstun,
Zaid Harchaoui,
Dean P. Foster,
Sham M. Kakade
Abstract:
This paper introduces a novel recurrent model for music composition that is tailored to the structure of polyphonic music. We propose an efficient new conditional probabilistic factorization of musical scores, viewing a score as a collection of concurrent, coupled sequences: i.e. voices. To model the conditional distributions, we borrow ideas from both convolutional and recurrent neural models; we…
▽ More
This paper introduces a novel recurrent model for music composition that is tailored to the structure of polyphonic music. We propose an efficient new conditional probabilistic factorization of musical scores, viewing a score as a collection of concurrent, coupled sequences: i.e. voices. To model the conditional distributions, we borrow ideas from both convolutional and recurrent neural models; we argue that these ideas are natural for capturing music's pitch invariances, temporal structure, and polyphony. We train models for single-voice and multi-voice composition on 2,300 scores from the KernScores dataset.
△ Less
Submitted 26 November, 2019; v1 submitted 19 November, 2018;
originally announced November 2018.
-
Adaptive Denoising of Signals with Local Shift-Invariant Structure
Authors:
Zaid Harchaoui,
Anatoli Juditsky,
Arkadi Nemirovski,
Dmitrii Ostrovskii
Abstract:
We discuss the problem of adaptive discrete-time signal denoising in the situation where the signal to be recovered admits a "linear oracle" -- an unknown linear estimate that takes the form of convolution of observations with a time-invariant filter. It was shown by Juditsky and Nemirovski (2009) that when the $\ell_2$-norm of the oracle filter is small enough, such oracle can be "mimicked" by an…
▽ More
We discuss the problem of adaptive discrete-time signal denoising in the situation where the signal to be recovered admits a "linear oracle" -- an unknown linear estimate that takes the form of convolution of observations with a time-invariant filter. It was shown by Juditsky and Nemirovski (2009) that when the $\ell_2$-norm of the oracle filter is small enough, such oracle can be "mimicked" by an efficiently computable adaptive estimate of the same structure with an observation-driven filter. The filter in question was obtained as a solution to the optimization problem in which the $\ell_\infty$-norm of the Discrete Fourier Transform (DFT) of the estimation residual is minimized under constraint on the $\ell_1$-norm of the filter DFT. In this paper, we discuss a new family of adaptive estimates which rely upon minimizing the $\ell_2$-norm of the estimation residual. We show that such estimators possess better statistical properties than those based on $\ell_\infty$-fit; in particular, we prove oracle inequalities for their $\ell_2$-loss and improved bounds for $\ell_2$- and pointwise losses. The oracle inequalities rely on the "approximate shift-invariance" assumption stating that the signal to be recovered is close to an (unknown) shift-invariant subspace. We also study the relationship of the approximate shift-invariance assumption with the "signal simplicity" assumption introduced in Juditsky and Nemirovski (2009) and discuss the application of the proposed approach to harmonic oscillations denoising.
△ Less
Submitted 11 February, 2021; v1 submitted 11 June, 2018;
originally announced June 2018.
-
Efficient First-Order Algorithms for Adaptive Signal Denoising
Authors:
Dmitrii Ostrovskii,
Zaid Harchaoui
Abstract:
We consider the problem of discrete-time signal denoising, focusing on a specific family of non-linear convolution-type estimators. Each such estimator is associated with a time-invariant filter which is obtained adaptively, by solving a certain convex optimization problem. Adaptive convolution-type estimators were demonstrated to have favorable statistical properties. However, the question of the…
▽ More
We consider the problem of discrete-time signal denoising, focusing on a specific family of non-linear convolution-type estimators. Each such estimator is associated with a time-invariant filter which is obtained adaptively, by solving a certain convex optimization problem. Adaptive convolution-type estimators were demonstrated to have favorable statistical properties. However, the question of their computational complexity remains largely unexplored, and in fact we are not aware of any publicly available implementation of these estimators. Our first contribution is an efficient implementation of these estimators via some known first-order proximal algorithms. Our second contribution is a computational complexity analysis of the proposed procedures, which takes into account their statistical nature and the related notion of statistical accuracy. The proposed procedures and their analysis are illustrated on a simulated data benchmark.
△ Less
Submitted 12 June, 2018; v1 submitted 29 March, 2018;
originally announced March 2018.
-
Catalyst Acceleration for First-order Convex Optimization: from Theory to Practice
Authors:
Hongzhou Lin,
Julien Mairal,
Zaid Harchaoui
Abstract:
We introduce a generic scheme for accelerating gradient-based optimization methods in the sense of Nesterov. The approach, called Catalyst, builds upon the inexact accelerated proximal point algorithm for minimizing a convex objective function, and consists of approximately solving a sequence of well-chosen auxiliary problems, leading to faster convergence. One of the keys to achieve acceleration…
▽ More
We introduce a generic scheme for accelerating gradient-based optimization methods in the sense of Nesterov. The approach, called Catalyst, builds upon the inexact accelerated proximal point algorithm for minimizing a convex objective function, and consists of approximately solving a sequence of well-chosen auxiliary problems, leading to faster convergence. One of the keys to achieve acceleration in theory and in practice is to solve these sub-problems with appropriate accuracy by using the right stopping criterion and the right warm-start strategy. We give practical guidelines to use Catalyst and present a comprehensive analysis of its global complexity. We show that Catalyst applies to a large class of algorithms, including gradient descent, block coordinate descent, incremental algorithms such as SAG, SAGA, SDCA, SVRG, MISO/Finito, and their proximal variants. For all of these methods, we establish faster rates using the Catalyst acceleration, for strongly convex and non-strongly convex objectives. We conclude with extensive experiments showing that acceleration is useful in practice, especially for ill-conditioned problems.
△ Less
Submitted 19 June, 2018; v1 submitted 15 December, 2017;
originally announced December 2017.
-
Invariances and Data Augmentation for Supervised Music Transcription
Authors:
John Thickstun,
Zaid Harchaoui,
Dean Foster,
Sham M. Kakade
Abstract:
This paper explores a variety of models for frame-based music transcription, with an emphasis on the methods needed to reach state-of-the-art on human recordings. The translation-invariant network discussed in this paper, which combines a traditional filterbank with a convolutional neural network, was the top-performing model in the 2017 MIREX Multiple Fundamental Frequency Estimation evaluation.…
▽ More
This paper explores a variety of models for frame-based music transcription, with an emphasis on the methods needed to reach state-of-the-art on human recordings. The translation-invariant network discussed in this paper, which combines a traditional filterbank with a convolutional neural network, was the top-performing model in the 2017 MIREX Multiple Fundamental Frequency Estimation evaluation. This class of models shares parameters in the log-frequency domain, which exploits the frequency invariance of music to reduce the number of model parameters and avoid overfitting to the training data. All models in this paper were trained with supervision by labeled data from the MusicNet dataset, augmented by random label-preserving pitch-shift transformations.
△ Less
Submitted 13 November, 2017;
originally announced November 2017.
-
Catalyst Acceleration for Gradient-Based Non-Convex Optimization
Authors:
Courtney Paquette,
Hongzhou Lin,
Dmitriy Drusvyatskiy,
Julien Mairal,
Zaid Harchaoui
Abstract:
We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. Even though these methods may originally require convexity to operate, the proposed approach allows one to use them on weakly convex objectives, which covers a large class of non-convex functions typically appearing in machine learning and sign…
▽ More
We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. Even though these methods may originally require convexity to operate, the proposed approach allows one to use them on weakly convex objectives, which covers a large class of non-convex functions typically appearing in machine learning and signal processing. In general, the scheme is guaranteed to produce a stationary point with a worst-case efficiency typical of first-order methods, and when the objective turns out to be convex, it automatically accelerates in the sense of Nesterov and achieves near-optimal convergence rate in function values. These properties are achieved without assuming any knowledge about the convexity of the objective, by automatically adapting to the unknown weak convexity constant. We conclude the paper by showing promising experimental results obtained by applying our approach to incremental algorithms such as SVRG and SAGA for sparse matrix factorization and for learning neural networks.
△ Less
Submitted 31 December, 2018; v1 submitted 31 March, 2017;
originally announced March 2017.
-
Learning Features of Music from Scratch
Authors:
John Thickstun,
Zaid Harchaoui,
Sham Kakade
Abstract:
This paper introduces a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments, together with instrument/note annotations resulting in over 1 million temporal labels on 34 hours of chamber music…
▽ More
This paper introduces a new large-scale music dataset, MusicNet, to serve as a source of supervision and evaluation of machine learning methods for music research. MusicNet consists of hundreds of freely-licensed classical music recordings by 10 composers, written for 11 instruments, together with instrument/note annotations resulting in over 1 million temporal labels on 34 hours of chamber music performances under various studio and microphone conditions.
The paper defines a multi-label classification task to predict notes in musical recordings, along with an evaluation protocol, and benchmarks several machine learning architectures for this task: i) learning from spectrogram features; ii) end-to-end learning with a neural net; iii) end-to-end learning with a convolutional neural net. These experiments show that end-to-end models trained for note prediction learn frequency selective filters as a low-level representation of audio.
△ Less
Submitted 5 April, 2017; v1 submitted 29 November, 2016;
originally announced November 2016.
-
An Inexact Variable Metric Proximal Point Algorithm for Generic Quasi-Newton Acceleration
Authors:
Hongzhou Lin,
Julien Mairal,
Zaid Harchaoui
Abstract:
We propose an inexact variable-metric proximal point algorithm to accelerate gradient-based optimization algorithms. The proposed scheme, called QNing can be notably applied to incremental first-order methods such as the stochastic variance-reduced gradient descent algorithm (SVRG) and other randomized incremental optimization algorithms. QNing is also compatible with composite objectives, meaning…
▽ More
We propose an inexact variable-metric proximal point algorithm to accelerate gradient-based optimization algorithms. The proposed scheme, called QNing can be notably applied to incremental first-order methods such as the stochastic variance-reduced gradient descent algorithm (SVRG) and other randomized incremental optimization algorithms. QNing is also compatible with composite objectives, meaning that it has the ability to provide exactly sparse solutions when the objective involves a sparsity-inducing regularization. When combined with limited-memory BFGS rules, QNing is particularly effective to solve high-dimensional optimization problems, while enjoying a worst-case linear convergence rate for strongly convex problems. We present experimental results where QNing gives significant improvements over competing methods for training machine learning methods on large samples and in high dimensions.
△ Less
Submitted 29 January, 2019; v1 submitted 4 October, 2016;
originally announced October 2016.
-
Fast and Simple Optimization for Poisson Likelihood Models
Authors:
Niao He,
Zaid Harchaoui,
Yichen Wang,
Le Song
Abstract:
Poisson likelihood models have been prevalently used in imaging, social networks, and time series analysis. We propose fast, simple, theoretically-grounded, and versatile, optimization algorithms for Poisson likelihood modeling. The Poisson log-likelihood is concave but not Lipschitz-continuous. Since almost all gradient-based optimization algorithms rely on Lipschitz-continuity, optimizing Poisso…
▽ More
Poisson likelihood models have been prevalently used in imaging, social networks, and time series analysis. We propose fast, simple, theoretically-grounded, and versatile, optimization algorithms for Poisson likelihood modeling. The Poisson log-likelihood is concave but not Lipschitz-continuous. Since almost all gradient-based optimization algorithms rely on Lipschitz-continuity, optimizing Poisson likelihood models with a guarantee of convergence can be challenging, especially for large-scale problems.
We present a new perspective allowing to efficiently optimize a wide range of penalized Poisson likelihood objectives. We show that an appropriate saddle point reformulation enjoys a favorable geometry and a smooth structure. Therefore, we can design a new gradient-based optimization algorithm with $O(1/t)$ convergence rate, in contrast to the usual $O(1/\sqrt{t})$ rate of non-smooth minimization alternatives. Furthermore, in order to tackle problems with large samples, we also develop a randomized block-decomposition variant that enjoys the same convergence rate yet more efficient iteration cost.
Experimental results on several point process applications including social network estimation and temporal recommendation show that the proposed algorithm and its randomized block variant outperform existing methods both on synthetic and real-world datasets.
△ Less
Submitted 3 August, 2016;
originally announced August 2016.
-
Rademacher Complexity Bounds for a Penalized Multiclass Semi-Supervised Algorithm
Authors:
Yury Maximov,
Massih-Reza Amini,
Zaid Harchaoui
Abstract:
We propose Rademacher complexity bounds for multiclass classifiers trained with a two-step semi-supervised model. In the first step, the algorithm partitions the partially labeled data and then identifies dense clusters containing $κ$ predominant classes using the labeled training examples such that the proportion of their non-predominant classes is below a fixed threshold. In the second step, a c…
▽ More
We propose Rademacher complexity bounds for multiclass classifiers trained with a two-step semi-supervised model. In the first step, the algorithm partitions the partially labeled data and then identifies dense clusters containing $κ$ predominant classes using the labeled training examples such that the proportion of their non-predominant classes is below a fixed threshold. In the second step, a classifier is trained by minimizing a margin empirical loss over the labeled training set and a penalization term measuring the disability of the learner to predict the $κ$ predominant classes of the identified clusters. The resulting data-dependent generalization error bound involves the margin distribution of the classifier, the stability of the clustering technique used in the first step and Rademacher complexity terms corresponding to partially labeled training data. Our theoretical result exhibit convergence rates extending those proposed in the literature for the binary case, and experimental results on different multiclass classification problems show empirical evidence that supports the theory.
△ Less
Submitted 25 January, 2018; v1 submitted 2 July, 2016;
originally announced July 2016.
-
Convolutional Kernel Networks
Authors:
Julien Mairal,
Piotr Koniusz,
Zaid Harchaoui,
Cordelia Schmid
Abstract:
An important goal in visual recognition is to devise image representations that are invariant to particular transformations. In this paper, we address this goal with a new type of convolutional neural network (CNN) whose invariance is encoded by a reproducing kernel. Unlike traditional approaches where neural networks are learned either to represent data or for solving a classification task, our n…
▽ More
An important goal in visual recognition is to devise image representations that are invariant to particular transformations. In this paper, we address this goal with a new type of convolutional neural network (CNN) whose invariance is encoded by a reproducing kernel. Unlike traditional approaches where neural networks are learned either to represent data or for solving a classification task, our network learns to approximate the kernel feature map on training data. Such an approach enjoys several benefits over classical ones. First, by teaching CNNs to be invariant, we obtain simple network architectures that achieve a similar accuracy to more complex ones, while being easy to train and robust to overfitting. Second, we bridge a gap between the neural network literature and kernels, which are natural tools to model invariance. We evaluate our methodology on visual recognition tasks where CNNs have proven to perform well, e.g., digit recognition with the MNIST dataset, and the more challenging CIFAR-10 and STL-10 datasets, where our accuracy is competitive with the state of the art.
△ Less
Submitted 14 November, 2014; v1 submitted 12 June, 2014;
originally announced June 2014.
-
Fast and Robust Archetypal Analysis for Representation Learning
Authors:
Yuansi Chen,
Julien Mairal,
Zaid Harchaoui
Abstract:
We revisit a pioneer unsupervised learning technique called archetypal analysis, which is related to successful data analysis methods such as sparse coding and non-negative matrix factorization. Since it was proposed, archetypal analysis did not gain a lot of popularity even though it produces more interpretable models than other alternatives. Because no efficient implementation has ever been made…
▽ More
We revisit a pioneer unsupervised learning technique called archetypal analysis, which is related to successful data analysis methods such as sparse coding and non-negative matrix factorization. Since it was proposed, archetypal analysis did not gain a lot of popularity even though it produces more interpretable models than other alternatives. Because no efficient implementation has ever been made publicly available, its application to important scientific problems may have been severely limited. Our goal is to bring back into favour archetypal analysis. We propose a fast optimization scheme using an active-set strategy, and provide an efficient open-source implementation interfaced with Matlab, R, and Python. Then, we demonstrate the usefulness of archetypal analysis for computer vision tasks, such as codebook learning, signal classification, and large image collection visualization.
△ Less
Submitted 26 May, 2014;
originally announced May 2014.
-
Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization
Authors:
Zaid Harchaoui,
Anatoli Juditsky,
Arkadi Nemirovski
Abstract:
Motivated by some applications in signal processing and machine learning, we consider two convex optimization problems where, given a cone $K$, a norm $\|\cdot\|$ and a smooth convex function $f$, we want either 1) to minimize the norm over the intersection of the cone and a level set of $f$, or 2) to minimize over the cone the sum of $f$ and a multiple of the norm. We focus on the case where (a)…
▽ More
Motivated by some applications in signal processing and machine learning, we consider two convex optimization problems where, given a cone $K$, a norm $\|\cdot\|$ and a smooth convex function $f$, we want either 1) to minimize the norm over the intersection of the cone and a level set of $f$, or 2) to minimize over the cone the sum of $f$ and a multiple of the norm. We focus on the case where (a) the dimension of the problem is too large to allow for interior point algorithms, (b) $\|\cdot\|$ is "too complicated" to allow for computationally cheap Bregman projections required in the first-order proximal gradient algorithms. On the other hand, we assume that {it is relatively easy to minimize linear forms over the intersection of $K$ and the unit $\|\cdot\|$-ball}. Motivating examples are given by the nuclear norm with $K$ being the entire space of matrices, or the positive semidefinite cone in the space of symmetric matrices, and the Total Variation norm on the space of 2D images. We discuss versions of the Conditional Gradient algorithm capable to handle our problems of interest, provide the related theoretical efficiency estimates and outline some applications.
△ Less
Submitted 28 March, 2013; v1 submitted 10 February, 2013;
originally announced February 2013.
-
Testing for Homogeneity with Kernel Fisher Discriminant Analysis
Authors:
Zaid Harchaoui,
Francis Bach,
Eric Moulines
Abstract:
We propose to investigate test statistics for testing homogeneity in reproducing kernel Hilbert spaces. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed and local alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial data and a speaker verification task is provided.
We propose to investigate test statistics for testing homogeneity in reproducing kernel Hilbert spaces. Asymptotic null distributions under null hypothesis are derived, and consistency against fixed and local alternatives is assessed. Finally, experimental evidence of the performance of the proposed approach on both artificial data and a speaker verification task is provided.
△ Less
Submitted 7 April, 2008;
originally announced April 2008.