-
Gaussian-Smoothed Sliced Probability Divergences
Authors:
Mokhtar Z. Alaya,
Alain Rakotomamonjy,
Maxime Berar,
Gilles Gasso
Abstract:
Gaussian smoothed sliced Wasserstein distance has been recently introduced for comparing probability distributions, while preserving privacy on the data. It has been shown that it provides performances similar to its non-smoothed (non-private) counterpart. However, the computationaland statistical properties of such a metric have not yet been well-established. This work investigates the theoretica…
▽ More
Gaussian smoothed sliced Wasserstein distance has been recently introduced for comparing probability distributions, while preserving privacy on the data. It has been shown that it provides performances similar to its non-smoothed (non-private) counterpart. However, the computationaland statistical properties of such a metric have not yet been well-established. This work investigates the theoretical properties of this distance as well as those of generalized versions denoted as Gaussian-smoothed sliced divergences. We first show that smoothing and slicing preserve the metric property and the weak topology. To study the sample complexity of such divergences, we then introduce $\hat{\hatμ}_{n}$ the double empirical distribution for the smoothed-projected $μ$. The distribution $\hat{\hatμ}_{n}$ is a result of a double sampling process: one from sampling according to the origin distribution $μ$ and the second according to the convolution of the projection of $μ$ on the unit sphere and the Gaussian smoothing. We particularly focus on the Gaussian smoothed sliced Wasserstein distance and prove that it converges with a rate $O(n^{-1/2})$. We also derive other properties, including continuity, of different divergences with respect to the smoothing parameter. We support our theoretical findings with empirical studies in the context of privacy-preserving domain adaptation.
△ Less
Submitted 25 April, 2024; v1 submitted 4 April, 2024;
originally announced April 2024.
-
Differentially Private Gradient Flow based on the Sliced Wasserstein Distance
Authors:
Ilana Sebag,
Muni Sreenivas Pydi,
Jean-Yves Franceschi,
Alain Rakotomamonjy,
Mike Gartrell,
Jamal Atif,
Alexandre Allauzen
Abstract:
Safeguarding privacy in sensitive training data is paramount, particularly in the context of generative modeling. This can be achieved through either differentially private stochastic gradient descent or a differentially private metric for training models or generators. In this paper, we introduce a novel differentially private generative modeling approach based on a gradient flow in the space of…
▽ More
Safeguarding privacy in sensitive training data is paramount, particularly in the context of generative modeling. This can be achieved through either differentially private stochastic gradient descent or a differentially private metric for training models or generators. In this paper, we introduce a novel differentially private generative modeling approach based on a gradient flow in the space of probability measures. To this end, we define the gradient flow of the Gaussian-smoothed Sliced Wasserstein Distance, including the associated stochastic differential equation (SDE). By discretizing and defining a numerical scheme for solving this SDE, we demonstrate the link between smoothing and differential privacy based on a Gaussian mechanism, due to a specific form of the SDE's drift term. We then analyze the differential privacy guarantee of our gradient flow, which accounts for both the smoothing and the Wiener process introduced by the SDE itself. Experiments show that our proposed model can generate higher-fidelity data at a low privacy budget compared to a generator-based model, offering a promising alternative.
△ Less
Submitted 20 January, 2025; v1 submitted 13 December, 2023;
originally announced December 2023.
-
Unifying GANs and Score-Based Diffusion as Generative Particle Models
Authors:
Jean-Yves Franceschi,
Mike Gartrell,
Ludovic Dos Santos,
Thibaut Issenhuth,
Emmanuel de Bézenac,
Mickaël Chen,
Alain Rakotomamonjy
Abstract:
Particle-based deep generative models, such as gradient flows and score-based diffusion models, have recently gained traction thanks to their striking performance. Their principle of displacing particle distributions using differential equations is conventionally seen as opposed to the previously widespread generative adversarial networks (GANs), which involve training a pushforward generator netw…
▽ More
Particle-based deep generative models, such as gradient flows and score-based diffusion models, have recently gained traction thanks to their striking performance. Their principle of displacing particle distributions using differential equations is conventionally seen as opposed to the previously widespread generative adversarial networks (GANs), which involve training a pushforward generator network. In this paper we challenge this interpretation, and propose a novel framework that unifies particle and adversarial generative models by framing generator training as a generalization of particle models. This suggests that a generator is an optional addition to any such generative model. Consequently, integrating a generator into a score-based diffusion model and training a GAN without a generator naturally emerge from our framework. We empirically test the viability of these original models as proofs of concepts of potential applications of our framework.
△ Less
Submitted 21 December, 2023; v1 submitted 25 May, 2023;
originally announced May 2023.
-
Sliced-Wasserstein on Symmetric Positive Definite Matrices for M/EEG Signals
Authors:
Clément Bonet,
Benoît Malézieux,
Alain Rakotomamonjy,
Lucas Drumetz,
Thomas Moreau,
Matthieu Kowalski,
Nicolas Courty
Abstract:
When dealing with electro or magnetoencephalography records, many supervised prediction tasks are solved by working with covariance matrices to summarize the signals. Learning with these matrices requires using Riemanian geometry to account for their structure. In this paper, we propose a new method to deal with distributions of covariance matrices and demonstrate its computational efficiency on M…
▽ More
When dealing with electro or magnetoencephalography records, many supervised prediction tasks are solved by working with covariance matrices to summarize the signals. Learning with these matrices requires using Riemanian geometry to account for their structure. In this paper, we propose a new method to deal with distributions of covariance matrices and demonstrate its computational efficiency on M/EEG multivariate time series. More specifically, we define a Sliced-Wasserstein distance between measures of symmetric positive definite matrices that comes with strong theoretical guarantees. Then, we take advantage of its properties and kernel methods to apply this distance to brain-age prediction from MEG data and compare it to state-of-the-art algorithms based on Riemannian geometry. Finally, we show that it is an efficient surrogate to the Wasserstein distance in domain adaptation for Brain Computer Interface applications.
△ Less
Submitted 24 May, 2023; v1 submitted 10 March, 2023;
originally announced March 2023.
-
Personalised Federated Learning On Heterogeneous Feature Spaces
Authors:
Alain Rakotomamonjy,
Maxime Vono,
Hamlet Jesse Medina Ruiz,
Liva Ralaivola
Abstract:
Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end,…
▽ More
Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions. The common feature space is learnt in a federated manner using Wasserstein barycenters while the local embedding functions are trained on each client via distribution alignment. We integrate this distribution alignement mechanism into a federated learning approach and provide the algorithmics of FLIC. We compare its performances against FL benchmarks involving heterogeneous input features spaces. In addition, we provide theoretical insights supporting the relevance of our methodology.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
Continuous PDE Dynamics Forecasting with Implicit Neural Representations
Authors:
Yuan Yin,
Matthieu Kirchmeyer,
Jean-Yves Franceschi,
Alain Rakotomamonjy,
Patrick Gallinari
Abstract:
Effective data-driven PDE forecasting methods often rely on fixed spatial and / or temporal discretizations. This raises limitations in real-world applications like weather prediction where flexible extrapolation at arbitrary spatiotemporal locations is required. We address this problem by introducing a new data-driven approach, DINo, that models a PDE's flow with continuous-time dynamics of spati…
▽ More
Effective data-driven PDE forecasting methods often rely on fixed spatial and / or temporal discretizations. This raises limitations in real-world applications like weather prediction where flexible extrapolation at arbitrary spatiotemporal locations is required. We address this problem by introducing a new data-driven approach, DINo, that models a PDE's flow with continuous-time dynamics of spatially continuous functions. This is achieved by embedding spatial observations independently of their discretization via Implicit Neural Representations in a small latent space temporally driven by a learned ODE. This separate and flexible treatment of time and space makes DINo the first data-driven model to combine the following advantages. It extrapolates at arbitrary spatial and temporal locations; it can learn from sparse irregular grids or manifolds; at test time, it generalizes to new grids or resolutions. DINo outperforms alternative neural PDE forecasters in a variety of challenging generalization scenarios on representative PDE systems.
△ Less
Submitted 15 February, 2023; v1 submitted 29 September, 2022;
originally announced September 2022.
-
Benchopt: Reproducible, efficient and collaborative optimization benchmarks
Authors:
Thomas Moreau,
Mathurin Massias,
Alexandre Gramfort,
Pierre Ablin,
Pierre-Antoine Bannier,
Benjamin Charlier,
Mathieu Dagréou,
Tom Dupré la Tour,
Ghislain Durif,
Cassio F. Dantas,
Quentin Klopfenstein,
Johan Larsson,
En Lai,
Tanguy Lefort,
Benoit Malézieux,
Badr Moufad,
Binh T. Nguyen,
Alain Rakotomamonjy,
Zaccharie Ramzi,
Joseph Salmon,
Samuel Vaiter
Abstract:
Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementat…
▽ More
Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong conclusions that slow down the progress of research. We propose Benchopt, a collaborative framework to automate, reproduce and publish optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: $\ell_2$-regularized logistic regression, Lasso, and ResNet18 training for image classification. These benchmarks highlight key practical findings that give a more nuanced view of the state-of-the-art for these problems, showing that for practical evaluation, the devil is in the details. We hope that Benchopt will foster collaborative work in the community hence improving the reproducibility of research findings.
△ Less
Submitted 28 October, 2022; v1 submitted 27 June, 2022;
originally announced June 2022.
-
Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances
Authors:
Ruben Ohana,
Kimia Nadjahi,
Alain Rakotomamonjy,
Liva Ralaivola
Abstract:
The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties -- or, more accurately, its generalization properties -- with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-B…
▽ More
The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties -- or, more accurately, its generalization properties -- with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and a central observation that SW may be interpreted as an average risk, the quantity PAC-Bayesian bounds have been designed to characterize. We provide three types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. SW defined with respect to arbitrary distributions of slices (among which data-dependent distributions), ii) a principled procedure to learn the distribution of slices that yields maximally discriminative SW, by optimizing our theoretical bounds, and iii) empirical illustrations of our theoretical findings.
△ Less
Submitted 31 May, 2023; v1 submitted 7 June, 2022;
originally announced June 2022.
-
Generalizing to New Physical Systems via Context-Informed Dynamics Model
Authors:
Matthieu Kirchmeyer,
Yuan Yin,
Jérémie Donà,
Nicolas Baskiotis,
Alain Rakotomamonjy,
Patrick Gallinari
Abstract:
Data-driven approaches to modeling physical systems fail to generalize to unseen systems that share the same general dynamics with the learning domain, but correspond to different physical contexts. We propose a new framework for this key problem, context-informed dynamics adaptation (CoDA), which takes into account the distributional shift across systems for fast and efficient adaptation to new d…
▽ More
Data-driven approaches to modeling physical systems fail to generalize to unseen systems that share the same general dynamics with the learning domain, but correspond to different physical contexts. We propose a new framework for this key problem, context-informed dynamics adaptation (CoDA), which takes into account the distributional shift across systems for fast and efficient adaptation to new dynamics. CoDA leverages multiple environments, each associated to a different dynamic, and learns to condition the dynamics model on contextual parameters, specific to each environment. The conditioning is performed via a hypernetwork, learned jointly with a context vector from observed data. The proposed formulation constrains the search hypothesis space to foster fast adaptation and better generalization across environments. We theoretically motivate our approach and show state-of-the-art generalization results on a set of nonlinear dynamics, representative of a variety of application domains. We also show, on these systems, that new system parameters can be inferred from context vectors with minimal supervision. Code is available at https://github.com/yuan-yin/CoDA .
△ Less
Submitted 24 June, 2022; v1 submitted 1 February, 2022;
originally announced February 2022.
-
Differentially Private Sliced Wasserstein Distance
Authors:
Alain Rakotomamonjy,
Liva Ralaivola
Abstract:
Developing machine learning methods that are privacy preserving is today a central topic of research, with huge practical impacts. Among the numerous ways to address privacy-preserving learning, we here take the perspective of computing the divergences between distributions under the Differential Privacy (DP) framework -- being able to compute divergences between distributions is pivotal for many…
▽ More
Developing machine learning methods that are privacy preserving is today a central topic of research, with huge practical impacts. Among the numerous ways to address privacy-preserving learning, we here take the perspective of computing the divergences between distributions under the Differential Privacy (DP) framework -- being able to compute divergences between distributions is pivotal for many machine learning problems, such as learning generative models or domain adaptation problems. Instead of resorting to the popular gradient-based sanitization method for DP, we tackle the problem at its roots by focusing on the Sliced Wasserstein Distance and seamlessly making it differentially private. Our main contribution is as follows: we analyze the property of adding a Gaussian perturbation to the intrinsic randomized mechanism of the Sliced Wasserstein Distance, and we establish the sensitivityof the resulting differentially private mechanism. One of our important findings is that this DP mechanism transforms the Sliced Wasserstein distance into another distance, that we call the Smoothed Sliced Wasserstein Distance. This new differentially private distribution distance can be plugged into generative models and domain adaptation algorithms in a transparent way, and we empirically show that it yields highly competitive performance compared with gradient-based DP approaches from the literature, with almost no loss in accuracy for the domain adaptation problems that we consider.
△ Less
Submitted 5 July, 2021;
originally announced July 2021.
-
Heterogeneous Wasserstein Discrepancy for Incomparable Distributions
Authors:
Mokhtar Z. Alaya,
Gilles Gasso,
Maxime Berar,
Alain Rakotomamonjy
Abstract:
Optimal Transport (OT) metrics allow for defining discrepancies between two probability measures. Wasserstein distance is for longer the celebrated OT-distance frequently-used in the literature, which seeks probability distributions to be supported on the $\textit{same}$ metric space. Because of its high computational complexity, several approximate Wasserstein distances have been proposed based o…
▽ More
Optimal Transport (OT) metrics allow for defining discrepancies between two probability measures. Wasserstein distance is for longer the celebrated OT-distance frequently-used in the literature, which seeks probability distributions to be supported on the $\textit{same}$ metric space. Because of its high computational complexity, several approximate Wasserstein distances have been proposed based on entropy regularization or on slicing, and one-dimensional Wassserstein computation. In this paper, we propose a novel extension of Wasserstein distance to compare two incomparable distributions, that hinges on the idea of $\textit{distributional slicing}$, embeddings, and on computing the closed-form Wassertein distance between the sliced distributions. We provide a theoretical analysis of this new divergence, called $\textit{heterogeneous Wasserstein discrepancy (HWD)}$, and we show that it preserves several interesting properties including rotation-invariance. We show that the embeddings involved in HWD can be efficiently learned. Finally, we provide a large set of experiments illustrating the behavior of HWD as a divergence in the context of generative modeling and in query framework.
△ Less
Submitted 12 October, 2021; v1 submitted 4 June, 2021;
originally announced June 2021.
-
Partial Trace Regression and Low-Rank Kraus Decomposition
Authors:
Hachem Kadri,
Stéphane Ayache,
Riikka Huusari,
Alain Rakotomamonjy,
Liva Ralaivola
Abstract:
The trace regression model, a direct extension of the well-studied linear regression model, allows one to map matrices to real-valued outputs. We here introduce an even more general model, namely the partial-trace regression model, a family of linear mappings from matrix-valued inputs to matrix-valued outputs; this model subsumes the trace regression model and thus the linear regression model. Bor…
▽ More
The trace regression model, a direct extension of the well-studied linear regression model, allows one to map matrices to real-valued outputs. We here introduce an even more general model, namely the partial-trace regression model, a family of linear mappings from matrix-valued inputs to matrix-valued outputs; this model subsumes the trace regression model and thus the linear regression model. Borrowing tools from quantum information theory, where partial trace operators have been extensively studied, we propose a framework for learning partial trace regression models from data by taking advantage of the so-called low-rank Kraus representation of completely positive maps. We show the relevance of our framework with synthetic and real-world experiments conducted for both i) matrix-to-matrix regression and ii) positive semidefinite matrix completion, two tasks which can be formulated as partial trace regression problems.
△ Less
Submitted 25 August, 2020; v1 submitted 2 July, 2020;
originally announced July 2020.
-
Provably Convergent Working Set Algorithm for Non-Convex Regularized Regression
Authors:
Alain Rakotomamonjy,
Rémi Flamary,
Gilles Gasso,
Joseph Salmon
Abstract:
Owing to their statistical properties, non-convex sparse regularizers have attracted much interest for estimating a sparse linear model from high dimensional data. Given that the solution is sparse, for accelerating convergence, a working set strategy addresses the optimization problem through an iterative algorithm by incre-menting the number of variables to optimize until the identification of t…
▽ More
Owing to their statistical properties, non-convex sparse regularizers have attracted much interest for estimating a sparse linear model from high dimensional data. Given that the solution is sparse, for accelerating convergence, a working set strategy addresses the optimization problem through an iterative algorithm by incre-menting the number of variables to optimize until the identification of the solution support. While those methods have been well-studied and theoretically supported for convex regularizers, this paper proposes a working set algorithm for non-convex sparse regularizers with convergence guarantees. The algorithm, named FireWorks, is based on a non-convex reformulation of a recent primal-dual approach and leverages on the geometry of the residuals. Our theoretical guarantees derive from a lower bound of the objective function decrease between two inner solver iterations and shows the convergence to a stationary point of the full problem. More importantly, we also show that convergence is preserved even when the inner solver is inexact, under sufficient decay of the error across iterations. Our experimental results demonstrate high computational gain when using our working set strategy compared to the full problem solver for both block-coordinate descent or a proximal gradient solver.
△ Less
Submitted 20 October, 2021; v1 submitted 24 June, 2020;
originally announced June 2020.
-
Multi-source Domain Adaptation via Weighted Joint Distributions Optimal Transport
Authors:
Rosanna Turrisi,
Rémi Flamary,
Alain Rakotomamonjy,
Massimiliano Pontil
Abstract:
The problem of domain adaptation on an unlabeled target dataset using knowledge from multiple labelled source datasets is becoming increasingly important. A key challenge is to design an approach that overcomes the covariate and target shift both among the sources, and between the source and target domains. In this paper, we address this problem from a new perspective: instead of looking for a lat…
▽ More
The problem of domain adaptation on an unlabeled target dataset using knowledge from multiple labelled source datasets is becoming increasingly important. A key challenge is to design an approach that overcomes the covariate and target shift both among the sources, and between the source and target domains. In this paper, we address this problem from a new perspective: instead of looking for a latent representation invariant between source and target domains, we exploit the diversity of source distributions by tuning their weights to the target task at hand. Our method, named Weighted Joint Distribution Optimal Transport (WJDOT), aims at finding simultaneously an Optimal Transport-based alignment between the source and target distributions and a re-weighting of the sources distributions. We discuss the theoretical aspects of the method and propose a conceptually simple algorithm. Numerical experiments indicate that the proposed method achieves state-of-the-art performance on simulated and real-life datasets.
△ Less
Submitted 2 June, 2022; v1 submitted 23 June, 2020;
originally announced June 2020.
-
Optimal Transport for Conditional Domain Matching and Label Shift
Authors:
Alain Rakotomamonjy,
Rémi Flamary,
Gilles Gasso,
Mokhtar Z. Alaya,
Maxime Berar,
Nicolas Courty
Abstract:
We address the problem of unsupervised domain adaptation under the setting of generalized target shift (joint class-conditional and label shifts). For this framework, we theoretically show that, for good generalization, it is necessary to learn a latent representation in which both marginals and class-conditional distributions are aligned across domains. For this sake, we propose a learning proble…
▽ More
We address the problem of unsupervised domain adaptation under the setting of generalized target shift (joint class-conditional and label shifts). For this framework, we theoretically show that, for good generalization, it is necessary to learn a latent representation in which both marginals and class-conditional distributions are aligned across domains. For this sake, we propose a learning problem that minimizes importance weighted loss in the source domain and a Wasserstein distance between weighted marginals. For a proper weighting, we provide an estimator of target label proportion by blending mixture estimation and optimal matching by optimal transport. This estimation comes with theoretical guarantees of correctness under mild assumptions. Our experimental results show that our method performs better on average than competitors across a range domain adaptation problems including \emph{digits},\emph{VisDA} and \emph{Office}. Code for this paper is available at \url{https://github.com/arakotom/mars_domain_adaptation}.
△ Less
Submitted 19 October, 2021; v1 submitted 15 June, 2020;
originally announced June 2020.
-
Theoretical Guarantees for Bridging Metric Measure Embedding and Optimal Transport
Authors:
Mokhtar Z. Alaya,
Maxime Bérar,
Gilles Gasso,
Alain Rakotomamonjy
Abstract:
We propose a novel approach for comparing distributions whose supports do not necessarily lie on the same metric space. Unlike Gromov-Wasserstein (GW) distance which compares pairwise distances of elements from each distribution, we consider a method allowing to embed the metric measure spaces in a common Euclidean space and compute an optimal transport (OT) on the embedded distributions. This lea…
▽ More
We propose a novel approach for comparing distributions whose supports do not necessarily lie on the same metric space. Unlike Gromov-Wasserstein (GW) distance which compares pairwise distances of elements from each distribution, we consider a method allowing to embed the metric measure spaces in a common Euclidean space and compute an optimal transport (OT) on the embedded distributions. This leads to what we call a sub-embedding robust Wasserstein (SERW) distance. Under some conditions, SERW is a distance that considers an OT distance of the (low-distorted) embedded distributions using a common metric. In addition to this novel proposal that generalizes several recent OT works, our contributions stand on several theoretical analyses: (i) we characterize the embedding spaces to define SERW distance for distribution alignment; (ii) we prove that SERW mimics almost the same properties of GW distance, and we give a cost relation between GW and SERW. The paper also provides some numerical illustrations of how SERW behaves on matching problems.
△ Less
Submitted 22 April, 2021; v1 submitted 19 February, 2020;
originally announced February 2020.
-
Screening Sinkhorn Algorithm for Regularized Optimal Transport
Authors:
Mokhtar Z. Alaya,
Maxime Bérar,
Gilles Gasso,
Alain Rakotomamonjy
Abstract:
We introduce in this paper a novel strategy for efficiently approximating the Sinkhorn distance between two discrete measures. After identifying neglectable components of the dual solution of the regularized Sinkhorn problem, we propose to screen those components by directly setting them at that value before entering the Sinkhorn problem. This allows us to solve a smaller Sinkhorn problem while en…
▽ More
We introduce in this paper a novel strategy for efficiently approximating the Sinkhorn distance between two discrete measures. After identifying neglectable components of the dual solution of the regularized Sinkhorn problem, we propose to screen those components by directly setting them at that value before entering the Sinkhorn problem. This allows us to solve a smaller Sinkhorn problem while ensuring approximation with provable guarantees. More formally, the approach is based on a new formulation of dual of Sinkhorn divergence problem and on the KKT optimality conditions of this problem, which enable identification of dual components to be screened. This new analysis leads to the Screenkhorn algorithm. We illustrate the efficiency of Screenkhorn on complex tasks such as dimensionality reduction and domain adaptation involving regularized optimal transport.
△ Less
Submitted 18 January, 2020; v1 submitted 20 June, 2019;
originally announced June 2019.
-
Screening Rules for Lasso with Non-Convex Sparse Regularizers
Authors:
Alain Rakotomamonjy,
Gilles Gasso,
Joseph Salmon
Abstract:
Leveraging on the convexity of the Lasso problem , screening rules help in accelerating solvers by discarding irrelevant variables, during the optimization process. However, because they provide better theoretical guarantees in identifying relevant variables, several non-convex regularizers for the Lasso have been proposed in the literature. This work is the first that introduces a screening rule…
▽ More
Leveraging on the convexity of the Lasso problem , screening rules help in accelerating solvers by discarding irrelevant variables, during the optimization process. However, because they provide better theoretical guarantees in identifying relevant variables, several non-convex regularizers for the Lasso have been proposed in the literature. This work is the first that introduces a screening rule strategy into a non-convex Lasso solver. The approach we propose is based on a iterative majorization-minimization (MM) strategy that includes a screening rule in the inner solver and a condition for propagating screened variables between iterations of MM. In addition to improve efficiency of solvers, we also provide guarantees that the inner solver is able to identify the zeros components of its critical point in finite time. Our experimental analysis illustrates the significant computational gain brought by the new screening rule compared to classical coordinate-descent or proximal gradient descent methods.
△ Less
Submitted 19 February, 2019; v1 submitted 16 February, 2019;
originally announced February 2019.
-
Distance Measure Machines
Authors:
Alain Rakotomamonjy,
Abraham Traoré,
Maxime Berar,
Rémi Flamary,
Nicolas Courty
Abstract:
This paper presents a distance-based discriminative framework for learning with probability distributions. Instead of using kernel mean embeddings or generalized radial basis kernels, we introduce embeddings based on dissimilarity of distributions to some reference distributions denoted as templates. Our framework extends the theory of similarity of Balcan et al. (2008) to the populatio…
▽ More
This paper presents a distance-based discriminative framework for learning with probability distributions. Instead of using kernel mean embeddings or generalized radial basis kernels, we introduce embeddings based on dissimilarity of distributions to some reference distributions denoted as templates. Our framework extends the theory of similarity of Balcan et al. (2008) to the population distribution case and we show that, for some learning problems, some dissimilarity on distribution achieves low-error linear decision functions with high probability. Our key result is to prove that the theory also holds for empirical distributions. Algorithmically, the proposed approach consists in computing a mapping based on pairwise dissimilarity where learning a linear decision function is amenable. Our experimental results show that the Wasserstein distance embedding performs better than kernel mean embeddings and computing Wasserstein distance is far more tractable than estimating pairwise Kullback-Leibler divergence of empirical distributions.
△ Less
Submitted 15 November, 2018; v1 submitted 1 March, 2018;
originally announced March 2018.
-
Concave losses for robust dictionary learning
Authors:
Rafael Will M de Araujo,
Roberto Hirata,
Alain Rakotomamonjy
Abstract:
Traditional dictionary learning methods are based on quadratic convex loss function and thus are sensitive to outliers. In this paper, we propose a generic framework for robust dictionary learning based on concave losses. We provide results on composition of concave functions, notably regarding super-gradient computations, that are key for developing generic dictionary learning algorithms applicab…
▽ More
Traditional dictionary learning methods are based on quadratic convex loss function and thus are sensitive to outliers. In this paper, we propose a generic framework for robust dictionary learning based on concave losses. We provide results on composition of concave functions, notably regarding super-gradient computations, that are key for developing generic dictionary learning algorithms applicable to smooth and non-smooth losses. In order to improve identification of outliers, we introduce an initialization heuristic based on undercomplete dictionary learning. Experimental results using synthetic and real data demonstrate that our method is able to better detect outliers, is capable of generating better dictionaries, outperforming state-of-the-art methods such as K-SVD and LC-KSVD.
△ Less
Submitted 2 November, 2017;
originally announced November 2017.
-
Joint Distribution Optimal Transportation for Domain Adaptation
Authors:
Nicolas Courty,
Rémi Flamary,
Amaury Habrard,
Alain Rakotomamonjy
Abstract:
This paper deals with the unsupervised domain adaptation problem, where one wants to estimate a prediction function $f$ in a given target domain without any labeled sample by exploiting the knowledge available from a source domain where labels are known. Our work makes the following assumption: there exists a non-linear transformation between the joint feature/label space distributions of the two…
▽ More
This paper deals with the unsupervised domain adaptation problem, where one wants to estimate a prediction function $f$ in a given target domain without any labeled sample by exploiting the knowledge available from a source domain where labels are known. Our work makes the following assumption: there exists a non-linear transformation between the joint feature/label space distributions of the two domain $\mathcal{P}_s$ and $\mathcal{P}_t$. We propose a solution of this problem with optimal transport, that allows to recover an estimated target $\mathcal{P}^f_t=(X,f(X))$ by optimizing simultaneously the optimal coupling and $f$. We show that our method corresponds to the minimization of a bound on the target error, and provide an efficient algorithmic solution, for which convergence is proved. The versatility of our approach, both in terms of class of hypothesis or loss functions is demonstrated with real world classification and regression problems, for which we reach or surpass state-of-the-art results.
△ Less
Submitted 22 October, 2017; v1 submitted 24 May, 2017;
originally announced May 2017.
-
Wasserstein Discriminant Analysis
Authors:
Rémi Flamary,
Marco Cuturi,
Nicolas Courty,
Alain Rakotomamonjy
Abstract:
Wasserstein Discriminant Analysis (WDA) is a new supervised method that can improve classification of high-dimensional data by computing a suitable linear map onto a lower dimensional subspace. Following the blueprint of classical Linear Discriminant Analysis (LDA), WDA selects the projection matrix that maximizes the ratio of two quantities: the dispersion of projected points coming from differen…
▽ More
Wasserstein Discriminant Analysis (WDA) is a new supervised method that can improve classification of high-dimensional data by computing a suitable linear map onto a lower dimensional subspace. Following the blueprint of classical Linear Discriminant Analysis (LDA), WDA selects the projection matrix that maximizes the ratio of two quantities: the dispersion of projected points coming from different classes, divided by the dispersion of projected points coming from the same class. To quantify dispersion, WDA uses regularized Wasserstein distances, rather than cross-variance measures which have been usually considered, notably in LDA. Thanks to the the underlying principles of optimal transport, WDA is able to capture both global (at distribution scale) and local (at samples scale) interactions between classes. Regularized Wasserstein distances can be computed using the Sinkhorn matrix scaling algorithm; We show that the optimization of WDA can be tackled using automatic differentiation of Sinkhorn iterations. Numerical experiments show promising results both in terms of prediction and visualization on toy examples and real life datasets such as MNIST and on deep features obtained from a subset of the Caltech dataset.
△ Less
Submitted 23 May, 2018; v1 submitted 29 August, 2016;
originally announced August 2016.
-
Operator-valued Kernels for Learning from Functional Response Data
Authors:
Hachem Kadri,
Emmanuel Duflos,
Philippe Preux,
Stéphane Canu,
Alain Rakotomamonjy,
Julien Audiffren
Abstract:
In this paper we consider the problems of supervised classification and regression in the case where attributes and labels are functions: a data is represented by a set of functions, and the label is also a function. We focus on the use of reproducing kernel Hilbert space theory to learn from such functional data. Basic concepts and properties of kernel-based learning are extended to include the e…
▽ More
In this paper we consider the problems of supervised classification and regression in the case where attributes and labels are functions: a data is represented by a set of functions, and the label is also a function. We focus on the use of reproducing kernel Hilbert space theory to learn from such functional data. Basic concepts and properties of kernel-based learning are extended to include the estimation of function-valued functions. In this setting, the representer theorem is restated, a set of rigorously defined infinite-dimensional operator-valued kernels that can be valuably applied when the data are functions is described, and a learning algorithm for nonlinear functional data analysis is introduced. The methodology is illustrated through speech and audio signal processing experiments.
△ Less
Submitted 2 November, 2016; v1 submitted 28 October, 2015;
originally announced October 2015.
-
Generalized conditional gradient: analysis of convergence and applications
Authors:
Alain Rakotomamonjy,
Rémi Flamary,
Nicolas Courty
Abstract:
The objectives of this technical report is to provide additional results on the generalized conditional gradient methods introduced by Bredies et al. [BLM05]. Indeed , when the objective function is smooth, we provide a novel certificate of optimality and we show that the algorithm has a linear convergence rate. Applications of this algorithm are also discussed.
The objectives of this technical report is to provide additional results on the generalized conditional gradient methods introduced by Bredies et al. [BLM05]. Indeed , when the objective function is smooth, we provide a novel certificate of optimality and we show that the algorithm has a linear convergence rate. Applications of this algorithm are also discussed.
△ Less
Submitted 22 October, 2015;
originally announced October 2015.
-
DC Proximal Newton for Non-Convex Optimization Problems
Authors:
Alain Rakotomamonjy,
Remi Flamary,
Gilles Gasso
Abstract:
We introduce a novel algorithm for solving learning problems where both the loss function and the regularizer are non-convex but belong to the class of difference of convex (DC) functions. Our contribution is a new general purpose proximal Newton algorithm that is able to deal with such a situation. The algorithm consists in obtaining a descent direction from an approximation of the loss function…
▽ More
We introduce a novel algorithm for solving learning problems where both the loss function and the regularizer are non-convex but belong to the class of difference of convex (DC) functions. Our contribution is a new general purpose proximal Newton algorithm that is able to deal with such a situation. The algorithm consists in obtaining a descent direction from an approximation of the loss function and then in performing a line search to ensure sufficient descent. A theoretical analysis is provided showing that the iterates of the proposed algorithm {admit} as limit points stationary points of the DC objective function. Numerical experiments show that our approach is more efficient than current state of the art for a problem with a convex loss functions and non-convex regularizer. We have also illustrated the benefit of our algorithm in high-dimensional transductive learning problem where both loss function and regularizers are non-convex.
△ Less
Submitted 2 July, 2015;
originally announced July 2015.
-
Functional Regularized Least Squares Classi cation with Operator-valued Kernels
Authors:
Hachem Kadri,
Asma Rabaoui,
Philippe Preux,
Emmanuel Duflos,
Alain Rakotomamonjy
Abstract:
Although operator-valued kernels have recently received increasing interest in various machine learning and functional data analysis problems such as multi-task learning or functional regression, little attention has been paid to the understanding of their associated feature spaces. In this paper, we explore the potential of adopting an operator-valued kernel feature space perspective for the anal…
▽ More
Although operator-valued kernels have recently received increasing interest in various machine learning and functional data analysis problems such as multi-task learning or functional regression, little attention has been paid to the understanding of their associated feature spaces. In this paper, we explore the potential of adopting an operator-valued kernel feature space perspective for the analysis of functional data. We then extend the Regularized Least Squares Classification (RLSC) algorithm to cover situations where there are multiple functions per observation. Experiments on a sound recognition problem show that the proposed method outperforms the classical RLSC algorithm.
△ Less
Submitted 12 January, 2013;
originally announced January 2013.
-
Adaptive Canonical Correlation Analysis Based On Matrix Manifolds
Authors:
Florian Yger,
Maxime Berar,
Gilles Gasso,
Alain Rakotomamonjy
Abstract:
In this paper, we formulate the Canonical Correlation Analysis (CCA) problem on matrix manifolds. This framework provides a natural way for dealing with matrix constraints and tools for building efficient algorithms even in an adaptive setting. Finally, an adaptive CCA algorithm is proposed and applied to a change detection problem in EEG signals.
In this paper, we formulate the Canonical Correlation Analysis (CCA) problem on matrix manifolds. This framework provides a natural way for dealing with matrix constraints and tools for building efficient algorithms even in an adaptive setting. Finally, an adaptive CCA algorithm is proposed and applied to a change detection problem in EEG signals.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Sparse Support Vector Infinite Push
Authors:
Alain Rakotomamonjy
Abstract:
In this paper, we address the problem of embedded feature selection for ranking on top of the list problems. We pose this problem as a regularized empirical risk minimization with $p$-norm push loss function ($p=\infty$) and sparsity inducing regularizers. We leverage the issues related to this challenging optimization problem by considering an alternating direction method of multipliers algorithm…
▽ More
In this paper, we address the problem of embedded feature selection for ranking on top of the list problems. We pose this problem as a regularized empirical risk minimization with $p$-norm push loss function ($p=\infty$) and sparsity inducing regularizers. We leverage the issues related to this challenging optimization problem by considering an alternating direction method of multipliers algorithm which is built upon proximal operators of the loss function and the regularizer. Our main technical contribution is thus to provide a numerical scheme for computing the infinite push loss function proximal operator. Experimental results on toy, DNA microarray and BCI problems show how our novel algorithm compares favorably to competitors for ranking on top while using fewer variables in the scoring function.
△ Less
Submitted 27 June, 2012;
originally announced June 2012.
-
Multiple Operator-valued Kernel Learning
Authors:
Hachem Kadri,
Alain Rakotomamonjy,
Francis Bach,
Philippe Preux
Abstract:
Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the…
▽ More
Positive definite operator-valued kernels generalize the well-known notion of reproducing kernels, and are naturally adapted to multi-output learning situations. This paper addresses the problem of learning a finite linear combination of infinite-dimensional operator-valued kernels which are suitable for extending functional data analysis methods to nonlinear contexts. We study this problem in the case of kernel ridge regression for functional responses with an lr-norm constraint on the combination coefficients. The resulting optimization problem is more involved than those of multiple scalar-valued kernel learning since operator-valued kernels pose more technical and theoretical issues. We propose a multiple operator-valued kernel learning algorithm based on solving a system of linear operator equations by using a block coordinatedescent procedure. We experimentally validate our approach on a functional regression task in the context of finger movement prediction in brain-computer interfaces.
△ Less
Submitted 14 June, 2012; v1 submitted 7 March, 2012;
originally announced March 2012.
-
Functional learning through kernels
Authors:
Stephane Canu,
Xavier Mary,
Alain Rakotomamonjy
Abstract:
This paper reviews the functional aspects of statistical learning theory. The main point under consideration is the nature of the hypothesis set when no prior information is available but data. Within this framework we first discuss about the hypothesis set: it is a vectorial space, it is a set of pointwise defined functions, and the evaluation functional on this set is a continuous mapping. Bas…
▽ More
This paper reviews the functional aspects of statistical learning theory. The main point under consideration is the nature of the hypothesis set when no prior information is available but data. Within this framework we first discuss about the hypothesis set: it is a vectorial space, it is a set of pointwise defined functions, and the evaluation functional on this set is a continuous mapping. Based on these principles an original theory is developed generalizing the notion of reproduction kernel Hilbert space to non hilbertian sets. Then it is shown that the hypothesis set of any learning machine has to be a generalized reproducing set. Therefore, thanks to a general "representer theorem", the solution of the learning problem is still a linear combination of a kernel. Furthermore, a way to design these kernels is given. To illustrate this framework some examples of such reproducing sets and kernels are given.
△ Less
Submitted 6 October, 2009;
originally announced October 2009.