-
Federated Wasserstein Distance
Authors:
Alain Rakotomamonjy,
Kimia Nadjahi,
Liva Ralaivola
Abstract:
We introduce a principled way of computing the Wasserstein distance between two distributions in a federated manner. Namely, we show how to estimate the Wasserstein distance between two samples stored and kept on different devices/clients whilst a central entity/server orchestrates the computations (again, without having access to the samples). To achieve this feat, we take advantage of the geomet…
▽ More
We introduce a principled way of computing the Wasserstein distance between two distributions in a federated manner. Namely, we show how to estimate the Wasserstein distance between two samples stored and kept on different devices/clients whilst a central entity/server orchestrates the computations (again, without having access to the samples). To achieve this feat, we take advantage of the geometric properties of the Wasserstein distance -- in particular, the triangle inequality -- and that of the associated {\em geodesics}: our algorithm, FedWad (for Federated Wasserstein Distance), iteratively approximates the Wasserstein distance by manipulating and exchanging distributions from the space of geodesics in lieu of the input samples. In addition to establishing the convergence properties of FedWad, we provide empirical results on federated coresets and federate optimal transport dataset distance, that we respectively exploit for building a novel federated model and for boosting performance of popular federated learning algorithms.
△ Less
Submitted 3 October, 2023;
originally announced October 2023.
-
Personalised Federated Learning On Heterogeneous Feature Spaces
Authors:
Alain Rakotomamonjy,
Maxime Vono,
Hamlet Jesse Medina Ruiz,
Liva Ralaivola
Abstract:
Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end,…
▽ More
Most personalised federated learning (FL) approaches assume that raw data of all clients are defined in a common subspace i.e. all clients store their data according to the same schema. For real-world applications, this assumption is restrictive as clients, having their own systems to collect and then store data, may use heterogeneous data representations. We aim at filling this gap. To this end, we propose a general framework coined FLIC that maps client's data onto a common feature space via local embedding functions. The common feature space is learnt in a federated manner using Wasserstein barycenters while the local embedding functions are trained on each client via distribution alignment. We integrate this distribution alignement mechanism into a federated learning approach and provide the algorithmics of FLIC. We compare its performances against FL benchmarks involving heterogeneous input features spaces. In addition, we provide theoretical insights supporting the relevance of our methodology.
△ Less
Submitted 26 January, 2023;
originally announced January 2023.
-
Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances
Authors:
Ruben Ohana,
Kimia Nadjahi,
Alain Rakotomamonjy,
Liva Ralaivola
Abstract:
The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties -- or, more accurately, its generalization properties -- with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-B…
▽ More
The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties -- or, more accurately, its generalization properties -- with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and a central observation that SW may be interpreted as an average risk, the quantity PAC-Bayesian bounds have been designed to characterize. We provide three types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. SW defined with respect to arbitrary distributions of slices (among which data-dependent distributions), ii) a principled procedure to learn the distribution of slices that yields maximally discriminative SW, by optimizing our theoretical bounds, and iii) empirical illustrations of our theoretical findings.
△ Less
Submitted 31 May, 2023; v1 submitted 7 June, 2022;
originally announced June 2022.
-
Differentially Private Sliced Wasserstein Distance
Authors:
Alain Rakotomamonjy,
Liva Ralaivola
Abstract:
Developing machine learning methods that are privacy preserving is today a central topic of research, with huge practical impacts. Among the numerous ways to address privacy-preserving learning, we here take the perspective of computing the divergences between distributions under the Differential Privacy (DP) framework -- being able to compute divergences between distributions is pivotal for many…
▽ More
Developing machine learning methods that are privacy preserving is today a central topic of research, with huge practical impacts. Among the numerous ways to address privacy-preserving learning, we here take the perspective of computing the divergences between distributions under the Differential Privacy (DP) framework -- being able to compute divergences between distributions is pivotal for many machine learning problems, such as learning generative models or domain adaptation problems. Instead of resorting to the popular gradient-based sanitization method for DP, we tackle the problem at its roots by focusing on the Sliced Wasserstein Distance and seamlessly making it differentially private. Our main contribution is as follows: we analyze the property of adding a Gaussian perturbation to the intrinsic randomized mechanism of the Sliced Wasserstein Distance, and we establish the sensitivityof the resulting differentially private mechanism. One of our important findings is that this DP mechanism transforms the Sliced Wasserstein distance into another distance, that we call the Smoothed Sliced Wasserstein Distance. This new differentially private distribution distance can be plugged into generative models and domain adaptation algorithms in a transparent way, and we empirically show that it yields highly competitive performance compared with gradient-based DP approaches from the literature, with almost no loss in accuracy for the domain adaptation problems that we consider.
△ Less
Submitted 5 July, 2021;
originally announced July 2021.
-
Photonic Differential Privacy with Direct Feedback Alignment
Authors:
Ruben Ohana,
Hamlet J. Medina Ruiz,
Julien Launay,
Alessandro Cappelli,
Iacopo Poli,
Liva Ralaivola,
Alain Rakotomamonjy
Abstract:
Optical Processing Units (OPUs) -- low-power photonic chips dedicated to large scale random projections -- have been used in previous work to train deep neural networks using Direct Feedback Alignment (DFA), an effective alternative to backpropagation. Here, we demonstrate how to leverage the intrinsic noise of optical random projections to build a differentially private DFA mechanism, making OPUs…
▽ More
Optical Processing Units (OPUs) -- low-power photonic chips dedicated to large scale random projections -- have been used in previous work to train deep neural networks using Direct Feedback Alignment (DFA), an effective alternative to backpropagation. Here, we demonstrate how to leverage the intrinsic noise of optical random projections to build a differentially private DFA mechanism, making OPUs a solution of choice to provide a private-by-design training. We provide a theoretical analysis of our adaptive privacy mechanism, carefully measuring how the noise of optical random projections propagates in the process and gives rise to provable Differential Privacy. Finally, we conduct experiments demonstrating the ability of our learning procedure to achieve solid end-task performance.
△ Less
Submitted 25 March, 2022; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Partial Trace Regression and Low-Rank Kraus Decomposition
Authors:
Hachem Kadri,
Stéphane Ayache,
Riikka Huusari,
Alain Rakotomamonjy,
Liva Ralaivola
Abstract:
The trace regression model, a direct extension of the well-studied linear regression model, allows one to map matrices to real-valued outputs. We here introduce an even more general model, namely the partial-trace regression model, a family of linear mappings from matrix-valued inputs to matrix-valued outputs; this model subsumes the trace regression model and thus the linear regression model. Bor…
▽ More
The trace regression model, a direct extension of the well-studied linear regression model, allows one to map matrices to real-valued outputs. We here introduce an even more general model, namely the partial-trace regression model, a family of linear mappings from matrix-valued inputs to matrix-valued outputs; this model subsumes the trace regression model and thus the linear regression model. Borrowing tools from quantum information theory, where partial trace operators have been extensively studied, we propose a framework for learning partial trace regression models from data by taking advantage of the so-called low-rank Kraus representation of completely positive maps. We show the relevance of our framework with synthetic and real-world experiments conducted for both i) matrix-to-matrix regression and ii) positive semidefinite matrix completion, two tasks which can be formulated as partial trace regression problems.
△ Less
Submitted 25 August, 2020; v1 submitted 2 July, 2020;
originally announced July 2020.
-
Quantum Bandits
Authors:
Balthazar Casalé,
Giuseppe Di Molfetta,
Hachem Kadri,
Liva Ralaivola
Abstract:
We consider the quantum version of the bandit problem known as {\em best arm identification} (BAI). We first propose a quantum modeling of the BAI problem, which assumes that both the learning agent and the environment are quantum; we then propose an algorithm based on quantum amplitude amplification to solve BAI. We formally analyze the behavior of the algorithm on all instances of the problem an…
▽ More
We consider the quantum version of the bandit problem known as {\em best arm identification} (BAI). We first propose a quantum modeling of the BAI problem, which assumes that both the learning agent and the environment are quantum; we then propose an algorithm based on quantum amplitude amplification to solve BAI. We formally analyze the behavior of the algorithm on all instances of the problem and we show, in particular, that it is able to get the optimal solution quadratically faster than what is known to hold in the classical case.
△ Less
Submitted 22 September, 2020; v1 submitted 15 February, 2020;
originally announced February 2020.
-
QuicK-means: Acceleration of K-means by learning a fast transform
Authors:
Luc Giffon,
Valentin Emiya,
Liva Ralaivola,
Hachem Kadri
Abstract:
K-means -- and the celebrated Lloyd algorithm -- is more than the clustering method it was originally designed to be. It has indeed proven pivotal to help increase the speed of many machine learning and data analysis techniques such as indexing, nearest-neighbor search and prediction, data compression; its beneficial use has been shown to carry over to the acceleration of kernel machines (when usi…
▽ More
K-means -- and the celebrated Lloyd algorithm -- is more than the clustering method it was originally designed to be. It has indeed proven pivotal to help increase the speed of many machine learning and data analysis techniques such as indexing, nearest-neighbor search and prediction, data compression; its beneficial use has been shown to carry over to the acceleration of kernel machines (when using the Nyström method). Here, we propose a fast extension of K-means, dubbed QuicK-means, that rests on the idea of expressing the matrix of the $K$ centroids as a product of sparse matrices, a feat made possible by recent results devoted to find approximations of matrices as a product of sparse factors. Using such a decomposition squashes the complexity of the matrix-vector product between the factorized $K \times D$ centroid matrix $\mathbf{U}$ and any vector from $\mathcal{O}(K D)$ to $\mathcal{O}(A \log A+B)$, with $A=\min (K, D)$ and $B=\max (K, D)$, where $D$ is the dimension of the training data. This drastic computational saving has a direct impact in the assignment process of a point to a cluster, meaning that it is not only tangible at prediction time, but also at training time, provided the factorization procedure is performed during Lloyd's algorithm. We precisely show that resorting to a factorization step at each iteration does not impair the convergence of the optimization scheme and that, depending on the context, it may entail a reduction of the training time. Finally, we provide discussions and numerical simulations that show the versatility of our computationally-efficient QuicK-means algorithm.
△ Less
Submitted 23 August, 2019;
originally announced August 2019.
-
Recovery and convergence rate of the Frank-Wolfe Algorithm for the m-EXACT-SPARSE Problem
Authors:
Farah Cherfaoui,
Valentin Emiya,
Liva Ralaivola,
Sandrine Anthoine
Abstract:
We study the properties of the Frank-Wolfe algorithm to solve the m-EXACT-SPARSE reconstruction problem, where a signal y must be expressed as a sparse linear combination of a predefined set of atoms, called dictionary. We prove that when the signal is sparse enough with respect to the coherence of the dictionary, then the iterative process implemented by the Frank-Wolfe algorithm only recruits at…
▽ More
We study the properties of the Frank-Wolfe algorithm to solve the m-EXACT-SPARSE reconstruction problem, where a signal y must be expressed as a sparse linear combination of a predefined set of atoms, called dictionary. We prove that when the signal is sparse enough with respect to the coherence of the dictionary, then the iterative process implemented by the Frank-Wolfe algorithm only recruits atoms from the support of the signal, that is the smallest set of atoms from the dictionary that allows for a perfect reconstruction of y. We also prove that under this same condition, there exists an iteration beyond which the algorithm converges exponentially.
△ Less
Submitted 22 May, 2019;
originally announced May 2019.
-
Frank-Wolfe Algorithm for the Exact Sparse Problem
Authors:
Farah Cherfaoui,
Valentin Emiya,
Liva Ralaivola,
Sandrine Anthoine
Abstract:
In this paper, we study the properties of the Frank-Wolfe algorithm to solve the \ExactSparse reconstruction problem. We prove that when the dictionary is quasi-incoherent, at each iteration, the Frank-Wolfe algorithm picks up an atom indexed by the support. We also prove that when the dictionary is quasi-incoherent, there exists an iteration beyond which the algorithm converges exponentially fast…
▽ More
In this paper, we study the properties of the Frank-Wolfe algorithm to solve the \ExactSparse reconstruction problem. We prove that when the dictionary is quasi-incoherent, at each iteration, the Frank-Wolfe algorithm picks up an atom indexed by the support. We also prove that when the dictionary is quasi-incoherent, there exists an iteration beyond which the algorithm converges exponentially fast.
△ Less
Submitted 18 December, 2018;
originally announced December 2018.
-
Dependency-dependent Bounds for Sums of Dependent Random Variables
Authors:
Christoph H. Lampert,
Liva Ralaivola,
Alexander Zimin
Abstract:
We consider the problem of bounding large deviations for non-i.i.d. random variables that are allowed to have arbitrary dependencies. Previous works typically assumed a specific dependence structure, namely the existence of independent components. Bounds that depend on the degree of dependence between the observations have only been studied in the theory of mixing processes, where variables are ti…
▽ More
We consider the problem of bounding large deviations for non-i.i.d. random variables that are allowed to have arbitrary dependencies. Previous works typically assumed a specific dependence structure, namely the existence of independent components. Bounds that depend on the degree of dependence between the observations have only been studied in the theory of mixing processes, where variables are time-ordered. Here, we introduce a new way of measuring dependences within an unordered set of variables. We prove concentration inequalities, that apply to any set of random variables, but benefit from the presence of weak dependencies. We also discuss applications and extensions of our results to related problems of machine learning and large deviations.
△ Less
Submitted 4 November, 2018;
originally announced November 2018.
-
Greedy methods, randomization approaches and multi-arm bandit algorithms for efficient sparsity-constrained optimization
Authors:
A Rakotomamonjy,
S Koço,
Liva Ralaivola
Abstract:
Several sparsity-constrained algorithms such as Orthogonal Matching Pursuit or the Frank-Wolfe algorithm with sparsity constraints work by iteratively selecting a novel atom to add to the current non-zero set of variables. This selection step is usually performed by computing the gradient and then by looking for the gradient component with maximal absolute entry. This step can be computationally e…
▽ More
Several sparsity-constrained algorithms such as Orthogonal Matching Pursuit or the Frank-Wolfe algorithm with sparsity constraints work by iteratively selecting a novel atom to add to the current non-zero set of variables. This selection step is usually performed by computing the gradient and then by looking for the gradient component with maximal absolute entry. This step can be computationally expensive especially for large-scale and high-dimensional data. In this work, we aim at accelerating these sparsity-constrained optimization algorithms by exploiting the key observation that, for these algorithms to work, one only needs the coordinate of the gradient's top entry. Hence, we introduce algorithms based on greedy methods and randomization approaches that aim at cheaply estimating the gradient and its top entry. Another of our contribution is to cast the problem of finding the best gradient entry as a best arm identification in a multi-armed bandit problem. Owing to this novel insight, we are able to provide a bandit-based algorithm that directly estimates the top entry in a very efficient way. Theoretical observations stating that the resulting inexact Frank-Wolfe or Orthogonal Matching Pursuit algorithms act, with high probability, similarly to their exact versions are also given. We have carried out several experiments showing that the greedy deterministic and the bandit approaches we propose can achieve an acceleration of an order of magnitude while being as efficient as the exact gradient when used in algorithms such as OMP, Frank-Wolfe or CoSaMP.
△ Less
Submitted 22 August, 2016; v1 submitted 26 August, 2015;
originally announced August 2015.
-
From Cutting Planes Algorithms to Compression Schemes and Active Learning
Authors:
Liva Ralaivola,
Ugo Louche
Abstract:
Cutting-plane methods are well-studied localization(and optimization) algorithms. We show that they provide a natural framework to perform machinelearning ---and not just to solve optimization problems posed by machinelearning--- in addition to their intended optimization use. In particular, theyallow one to learn sparse classifiers and provide good compression schemes.Moreover, we show that very…
▽ More
Cutting-plane methods are well-studied localization(and optimization) algorithms. We show that they provide a natural framework to perform machinelearning ---and not just to solve optimization problems posed by machinelearning--- in addition to their intended optimization use. In particular, theyallow one to learn sparse classifiers and provide good compression schemes.Moreover, we show that very little effort is required to turn them intoeffective active learning methods. This last property provides a generic way todesign a whole family of active learning algorithms from existing passivemethods. We present numerical simulations testifying of the relevance ofcutting-plane methods for passive and active learning tasks.
△ Less
Submitted 12 August, 2015;
originally announced August 2015.
-
Unconfused ultraconservative multiclass algorithms
Authors:
Ugo Louche,
Liva Ralaivola
Abstract:
We tackle the problem of learning linear classifiers from noisy datasets in a multiclass setting. The two-class version of this problem was studied a few years ago where the proposed approaches to combat the noise revolve around a Per-ceptron learning scheme fed with peculiar examples computed through a weighted average of points from the noisy training set. We propose to build upon these approach…
▽ More
We tackle the problem of learning linear classifiers from noisy datasets in a multiclass setting. The two-class version of this problem was studied a few years ago where the proposed approaches to combat the noise revolve around a Per-ceptron learning scheme fed with peculiar examples computed through a weighted average of points from the noisy training set. We propose to build upon these approaches and we introduce a new algorithm called UMA (for Unconfused Multiclass additive Algorithm) which may be seen as a generalization to the multiclass setting of the previous approaches. In order to characterize the noise we use the confusion matrix as a multiclass extension of the classification noise studied in the aforemen-tioned literature. Theoretically well-founded, UMA furthermore displays very good empirical noise robustness, as evidenced by numerical simulations conducted on both synthetic and real data.
△ Less
Submitted 24 June, 2015;
originally announced June 2015.
-
On Generalizing the C-Bound to the Multiclass and Multi-label Settings
Authors:
Francois Laviolette,
Emilie Morvant,
Liva Ralaivola,
Jean-Francis Roy
Abstract:
The C-bound, introduced in Lacasse et al., gives a tight upper bound on the risk of a binary majority vote classifier. In this work, we present a first step towards extending this work to more complex outputs, by providing generalizations of the C-bound to the multiclass and multi-label settings.
The C-bound, introduced in Lacasse et al., gives a tight upper bound on the risk of a binary majority vote classifier. In this work, we present a first step towards extending this work to more complex outputs, by providing generalizations of the C-bound to the multiclass and multi-label settings.
△ Less
Submitted 13 January, 2015;
originally announced January 2015.
-
Dynamic Screening: Accelerating First-Order Algorithms for the Lasso and Group-Lasso
Authors:
Antoine Bonnefoy,
Valentin Emiya,
Liva Ralaivola,
Rémi Gribonval
Abstract:
Recent computational strategies based on screening tests have been proposed to accelerate algorithms addressing penalized sparse regression problems such as the Lasso. Such approaches build upon the idea that it is worth dedicating some small computational effort to locate inactive atoms and remove them from the dictionary in a preprocessing stage so that the regression algorithm working with a sm…
▽ More
Recent computational strategies based on screening tests have been proposed to accelerate algorithms addressing penalized sparse regression problems such as the Lasso. Such approaches build upon the idea that it is worth dedicating some small computational effort to locate inactive atoms and remove them from the dictionary in a preprocessing stage so that the regression algorithm working with a smaller dictionary will then converge faster to the solution of the initial problem. We believe that there is an even more efficient way to screen the dictionary and obtain a greater acceleration: inside each iteration of the regression algorithm, one may take advantage of the algorithm computations to obtain a new screening test for free with increasing screening effects along the iterations. The dictionary is henceforth dynamically screened instead of being screened statically, once and for all, before the first iteration. We formalize this dynamic screening principle in a general algorithmic scheme and apply it by embedding inside a number of first-order algorithms adapted existing screening tests to solve the Lasso or new screening tests to solve the Group-Lasso. Computational gains are assessed in a large set of experiments on synthetic data as well as real-world sounds and images. They show both the screening efficiency and the gain in terms running times.
△ Less
Submitted 12 December, 2014;
originally announced December 2014.
-
On the Generalization of the C-Bound to Structured Output Ensemble Methods
Authors:
François Laviolette,
Emilie Morvant,
Liva Ralaivola,
Jean-Francis Roy
Abstract:
This paper generalizes an important result from the PAC-Bayesian literature for binary classification to the case of ensemble methods for structured outputs. We prove a generic version of the \Cbound, an upper bound over the risk of models expressed as a weighted majority vote that is based on the first and second statistical moments of the vote's margin. This bound may advantageously $(i)$ be app…
▽ More
This paper generalizes an important result from the PAC-Bayesian literature for binary classification to the case of ensemble methods for structured outputs. We prove a generic version of the \Cbound, an upper bound over the risk of models expressed as a weighted majority vote that is based on the first and second statistical moments of the vote's margin. This bound may advantageously $(i)$ be applied on more complex outputs such as multiclass labels and multilabel, and $(ii)$ allow to consider margin relaxations. These results open the way to develop new ensemble methods for structured output prediction with PAC-Bayesian guarantees.
△ Less
Submitted 15 June, 2015; v1 submitted 6 August, 2014;
originally announced August 2014.
-
Stationary Mixing Bandits
Authors:
Julien Audiffren,
Liva Ralaivola
Abstract:
We study the bandit problem where arms are associated with stationary phi-mixing processes and where rewards are therefore dependent: the question that arises from this setting is that of recovering some independence by ignoring the value of some rewards. As we shall see, the bandit problem we tackle requires us to address the exploration/exploitation/independence trade-off. To do so, we provide a…
▽ More
We study the bandit problem where arms are associated with stationary phi-mixing processes and where rewards are therefore dependent: the question that arises from this setting is that of recovering some independence by ignoring the value of some rewards. As we shall see, the bandit problem we tackle requires us to address the exploration/exploitation/independence trade-off. To do so, we provide a UCB strategy together with a general regret analysis for the case where the size of the independence blocks (the ignored rewards) is fixed and we go a step beyond by providing an algorithm that is able to compute the size of the independence blocks from the data. Finally, we give an analysis of our bandit problem in the restless case, i.e., in the situation where the time counters for all mixing processes simultaneously evolve.
△ Less
Submitted 23 June, 2014;
originally announced June 2014.
-
Unconfused Ultraconservative Multiclass Algorithms
Authors:
Ugo Louche,
Liva Ralaivola
Abstract:
We tackle the problem of learning linear classifiers from noisy datasets in a multiclass setting. The two-class version of this problem was studied a few years ago by, e.g. Bylander (1994) and Blum et al. (1996): in these contributions, the proposed approaches to fight the noise revolve around a Perceptron learning scheme fed with peculiar examples computed through a weighted average of points fro…
▽ More
We tackle the problem of learning linear classifiers from noisy datasets in a multiclass setting. The two-class version of this problem was studied a few years ago by, e.g. Bylander (1994) and Blum et al. (1996): in these contributions, the proposed approaches to fight the noise revolve around a Perceptron learning scheme fed with peculiar examples computed through a weighted average of points from the noisy training set. We propose to build upon these approaches and we introduce a new algorithm called UMA (for Unconfused Multiclass additive Algorithm) which may be seen as a generalization to the multiclass setting of the previous approaches. In order to characterize the noise we use the confusion matrix as a multiclass extension of the classification noise studied in the aforementioned literature. Theoretically well-founded, UMA furthermore displays very good empirical noise robustness, as evidenced by numerical simulations conducted on both synthetic and real data. Keywords: Multiclass classification, Perceptron, Noisy labels, Confusion Matrix
△ Less
Submitted 20 March, 2014;
originally announced March 2014.
-
PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification
Authors:
Emilie Morvant,
Sokol Koço,
Liva Ralaivola
Abstract:
In this work, we propose a PAC-Bayes bound for the generalization risk of the Gibbs classifier in the multi-class classification framework. The novelty of our work is the critical use of the confusion matrix of a classifier as an error measure; this puts our contribution in the line of work aiming at dealing with performance measure that are richer than mere scalar criterion such as the misclassif…
▽ More
In this work, we propose a PAC-Bayes bound for the generalization risk of the Gibbs classifier in the multi-class classification framework. The novelty of our work is the critical use of the confusion matrix of a classifier as an error measure; this puts our contribution in the line of work aiming at dealing with performance measure that are richer than mere scalar criterion such as the misclassification rate. Thanks to very recent and beautiful results on matrix concentration inequalities, we derive two bounds showing that the true confusion risk of the Gibbs classifier is upper-bounded by its empirical risk plus a term depending on the number of training examples in each class. To the best of our knowledge, this is the first PAC-Bayes bounds based on confusion matrices.
△ Less
Submitted 22 October, 2013; v1 submitted 28 February, 2012;
originally announced February 2012.
-
Confusion Matrix Stability Bounds for Multiclass Classification
Authors:
Pierre Machart,
Liva Ralaivola
Abstract:
In this paper, we provide new theoretical results on the generalization properties of learning algorithms for multiclass classification problems. The originality of our work is that we propose to use the confusion matrix of a classifier as a measure of its quality; our contribution is in the line of work which attempts to set up and study the statistical properties of new evaluation measures such…
▽ More
In this paper, we provide new theoretical results on the generalization properties of learning algorithms for multiclass classification problems. The originality of our work is that we propose to use the confusion matrix of a classifier as a measure of its quality; our contribution is in the line of work which attempts to set up and study the statistical properties of new evaluation measures such as, e.g. ROC curves. In the confusion-based learning framework we propose, we claim that a targetted objective is to minimize the size of the confusion matrix C, measured through its operator norm ||C||. We derive generalization bounds on the (size of the) confusion matrix in an extended framework of uniform stability, adapted to the case of matrix valued loss. Pivotal to our study is a very recent matrix concentration inequality that generalizes McDiarmid's inequality. As an illustration of the relevance of our theoretical results, we show how two SVM learning procedures can be proved to be confusion-friendly. To the best of our knowledge, the present paper is the first that focuses on the confusion matrix from a theoretical point of view.
△ Less
Submitted 24 May, 2012; v1 submitted 28 February, 2012;
originally announced February 2012.
-
Stochastic Low-Rank Kernel Learning for Regression
Authors:
Pierre Machart,
Thomas Peel,
Liva Ralaivola,
Sandrine Anthoine,
Hervé Glotin
Abstract:
We present a novel approach to learn a kernel-based regression function. It is based on the useof conical combinations of data-based parameterized kernels and on a new stochastic convex optimization procedure of which we establish convergence guarantees. The overall learning procedure has the nice properties that a) the learned conical combination is automatically designed to perform the regressio…
▽ More
We present a novel approach to learn a kernel-based regression function. It is based on the useof conical combinations of data-based parameterized kernels and on a new stochastic convex optimization procedure of which we establish convergence guarantees. The overall learning procedure has the nice properties that a) the learned conical combination is automatically designed to perform the regression task at hand and b) the updates implicated by the optimization procedure are quite inexpensive. In order to shed light on the appositeness of our learning strategy, we present empirical results from experiments conducted on various benchmark datasets.
△ Less
Submitted 11 January, 2012;
originally announced January 2012.
-
Chromatic PAC-Bayes Bounds for Non-IID Data: Applications to Ranking and Stationary $β$-Mixing Processes
Authors:
Liva Ralaivola,
Marie Szafranski,
Guillaume Stempfel
Abstract:
Pac-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers…
▽ More
Pac-Bayes bounds are among the most accurate generalization bounds for classifiers learned from independently and identically distributed (IID) data, and it is particularly so for margin classifiers: there have been recent contributions showing how practical these bounds can be either to perform model selection (Ambroladze et al., 2007) or even to directly guide the learning of linear classifiers (Germain et al., 2009). However, there are many practical situations where the training data show some dependencies and where the traditional IID assumption does not hold. Stating generalization bounds for such frameworks is therefore of the utmost interest, both from theoretical and practical standpoints. In this work, we propose the first - to the best of our knowledge - Pac-Bayes generalization bounds for classifiers trained on data exhibiting interdependencies. The approach undertaken to establish our results is based on the decomposition of a so-called dependency graph that encodes the dependencies within the data, in sets of independent data, thanks to graph fractional covers. Our bounds are very general, since being able to find an upper bound on the fractional chromatic number of the dependency graph is sufficient to get new Pac-Bayes bounds for specific settings. We show how our results can be used to derive bounds for ranking statistics (such as Auc) and classifiers trained on data distributed according to a stationary ß-mixing process. In the way, we show how our approach seemlessly allows us to deal with U-processes. As a side note, we also provide a Pac-Bayes generalization bound for classifiers learned on data from stationary $\varphi$-mixing distributions.
△ Less
Submitted 4 June, 2010; v1 submitted 10 September, 2009;
originally announced September 2009.
-
The pharmacophore kernel for virtual screening with support vector machines
Authors:
Pierre Mahé,
Liva Ralaivola,
Véronique Stoven,
Jean-Philippe Vert
Abstract:
We introduce a family of positive definite kernels specifically optimized for the manipulation of 3D structures of molecules with kernel methods. The kernels are based on the comparison of the three-points pharmacophores present in the 3D structures of molecul es, a set of molecular features known to be particularly relevant for virtual screening applications. We present a computationally demand…
▽ More
We introduce a family of positive definite kernels specifically optimized for the manipulation of 3D structures of molecules with kernel methods. The kernels are based on the comparison of the three-points pharmacophores present in the 3D structures of molecul es, a set of molecular features known to be particularly relevant for virtual screening applications. We present a computationally demanding exact implementation of these kernels, as well as fast approximations related to the classical fingerprint-based approa ches. Experimental results suggest that this new approach outperforms state-of-the-art algorithms based on the 2D structure of mol ecules for the detection of inhibitors of several drug targets.
△ Less
Submitted 3 March, 2006;
originally announced March 2006.